| @@ -0,0 +1,21 @@ | |||
| MIT License | |||
| Copyright (c) 2017 Jiaxuan You, Rex Ying | |||
| Permission is hereby granted, free of charge, to any person obtaining a copy | |||
| of this software and associated documentation files (the "Software"), to deal | |||
| in the Software without restriction, including without limitation the rights | |||
| to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |||
| copies of the Software, and to permit persons to whom the Software is | |||
| furnished to do so, subject to the following conditions: | |||
| The above copyright notice and this permission notice shall be included in all | |||
| copies or substantial portions of the Software. | |||
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |||
| IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | |||
| FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | |||
| AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | |||
| LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | |||
| OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | |||
| SOFTWARE. | |||
| @@ -0,0 +1,86 @@ | |||
| # GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Model | |||
| This repository is the official PyTorch implementation of GraphRNN, a graph generative model using auto-regressive model. | |||
| [Jiaxuan You](https://cs.stanford.edu/~jiaxuan/)\*, [Rex Ying](https://cs.stanford.edu/people/rexy/)\*, [Xiang Ren](http://www-bcf.usc.edu/~xiangren/), [William L. Hamilton](https://stanford.edu/~wleif/), [Jure Leskovec](https://cs.stanford.edu/people/jure/index.html), [GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Model](https://arxiv.org/abs/1802.08773) (ICML 2018) | |||
| ## Installation | |||
| Install PyTorch following the instuctions on the [official website](https://pytorch.org/). The code has been tested over PyTorch 0.2.0 and 0.4.0 versions. | |||
| ```bash | |||
| conda install pytorch torchvision cuda90 -c pytorch | |||
| ``` | |||
| Then install the other dependencies. | |||
| ```bash | |||
| pip install -r requirements.txt | |||
| ``` | |||
| ## Test run | |||
| ```bash | |||
| python main.py | |||
| ``` | |||
| ## Code description | |||
| For the GraphRNN model: | |||
| `main.py` is the main executable file, and specific arguments are set in `args.py`. | |||
| `train.py` includes training iterations and calls `model.py` and `data.py` | |||
| `create_graphs.py` is where we prepare target graph datasets. | |||
| For baseline models: | |||
| * B-A and E-R models are implemented in `baselines/baseline_simple.py`. | |||
| * [Kronecker graph model](https://cs.stanford.edu/~jure/pubs/kronecker-jmlr10.pdf) is implemented in the SNAP software, which can be found in `https://github.com/snap-stanford/snap/tree/master/examples/krongen` (for generating Kronecker graphs), and `https://github.com/snap-stanford/snap/tree/master/examples/kronfit` (for learning parameters for the model). | |||
| * MMSB is implemented using the EDWARD library (http://edwardlib.org/), and is located in | |||
| `baselines`. | |||
| * We implemented the DeepGMG model based on the instructions of their [paper](https://arxiv.org/abs/1803.03324) in `main_DeepGMG.py`. | |||
| * We implemented the GraphVAE model based on the instructions of their [paper](https://arxiv.org/abs/1802.03480) in `baselines/graphvae`. | |||
| Parameter setting: | |||
| To adjust the hyper-parameter and input arguments to the model, modify the fields of `args.py` | |||
| accordingly. | |||
| For example, `args.cuda` controls which GPU is used to train the model, and `args.graph_type` | |||
| specifies which dataset is used to train the generative model. See the documentation in `args.py` | |||
| for more detailed descriptions of all fields. | |||
| ## Outputs | |||
| There are several different types of outputs, each saved into a different directory under a path prefix. The path prefix is set at `args.dir_input`. Suppose that this field is set to `./`: | |||
| * `./graphs` contains the pickle files of training, test and generated graphs. Each contains a list | |||
| of networkx object. | |||
| * `./eval_results` contains the evaluation of MMD scores in txt format. | |||
| * `./model_save` stores the model checkpoints | |||
| * `./nll` saves the log-likelihood for generated graphs as sequences. | |||
| * `./figures` is used to save visualizations (see Visualization of graphs section). | |||
| ## Evaluation | |||
| The evaluation is done in `evaluate.py`, where user can choose which settings to evaluate. | |||
| To evaluate how close the generated graphs are to the ground truth set, we use MMD (maximum mean discrepancy) to calculate the divergence between two _sets of distributions_ related to | |||
| the ground truth and generated graphs. | |||
| Three types of distributions are chosen: degree distribution, clustering coefficient distribution. | |||
| Both of which are implemented in `eval/stats.py`, using multiprocessing python | |||
| module. One can easily extend the evaluation to compute MMD for other distribution of graphs. | |||
| We also compute the orbit counts for each graph, represented as a high-dimensional data point. We then compute the MMD | |||
| between the two _sets of sampled points_ using ORCA (see http://www.biolab.si/supp/orca/orca.html) at `eval/orca`. | |||
| One first needs to compile ORCA by | |||
| ```bash | |||
| g++ -O2 -std=c++11 -o orca orca.cpp` | |||
| ``` | |||
| in directory `eval/orca`. | |||
| (the binary file already in repo works in Ubuntu). | |||
| To evaluate, run | |||
| ```bash | |||
| python evaluate.py | |||
| ``` | |||
| Arguments specific to evaluation is specified in class | |||
| `evaluate.Args_evaluate`. Note that the field `Args_evaluate.dataset_name_all` must only contain | |||
| datasets that are already trained, by setting args.graph_type to each of the datasets and running | |||
| `python main.py`. | |||
| ## Visualization of graphs | |||
| The training, testing and generated graphs are saved at 'graphs/'. | |||
| One can visualize the generated graph using the function `utils.load_graph_list`, which loads the | |||
| list of graphs from the pickle file, and `util.draw_graph_list`, which plots the graph using | |||
| networkx. | |||
| ## Misc | |||
| Jesse Bettencourt and Harris Chan have made a great [slide](https://duvenaud.github.io/learn-discrete/slides/graphrnn.pdf) introducing GraphRNN in Prof. David Duvenaud’s seminar course [Learning Discrete Latent Structure](https://duvenaud.github.io/learn-discrete/). | |||
| @@ -0,0 +1,216 @@ | |||
| # this file is used to plot images | |||
| from main import * | |||
| args = Args() | |||
| print(args.graph_type, args.note) | |||
| # epoch = 16000 | |||
| epoch = 3000 | |||
| sample_time = 3 | |||
| def find_nearest_idx(array,value): | |||
| idx = (np.abs(array-value)).argmin() | |||
| return idx | |||
| # for baseline model | |||
| for num_layers in range(4,5): | |||
| # give file name and figure name | |||
| fname_real = args.graph_save_path + args.fname_real + str(0) | |||
| fname_pred = args.graph_save_path + args.fname_pred + str(epoch) +'_'+str(sample_time) | |||
| figname = args.figure_save_path + args.fname + str(epoch) +'_'+str(sample_time) | |||
| # fname_real = args.graph_save_path + args.note + '_' + args.graph_type + '_' + str(args.graph_node_num) + '_' + \ | |||
| # str(epoch) + '_real_' + str(True) + '_' + str(num_layers) | |||
| # fname_pred = args.graph_save_path + args.note + '_' + args.graph_type + '_' + str(args.graph_node_num) + '_' + \ | |||
| # str(epoch) + '_pred_' + str(True) + '_' + str(num_layers) | |||
| # figname = args.figure_save_path + args.note + '_' + args.graph_type + '_' + str(args.graph_node_num) + '_' + \ | |||
| # str(epoch) + '_' + str(num_layers) | |||
| print(fname_real) | |||
| print(fname_pred) | |||
| # load data | |||
| graph_real_list = load_graph_list(fname_real + '.dat') | |||
| shuffle(graph_real_list) | |||
| graph_pred_list_raw = load_graph_list(fname_pred + '.dat') | |||
| graph_real_len_list = np.array([len(graph_real_list[i]) for i in range(len(graph_real_list))]) | |||
| graph_pred_len_list_raw = np.array([len(graph_pred_list_raw[i]) for i in range(len(graph_pred_list_raw))]) | |||
| graph_pred_list = graph_pred_list_raw | |||
| graph_pred_len_list = graph_pred_len_list_raw | |||
| # # select samples | |||
| # graph_pred_list = [] | |||
| # graph_pred_len_list = [] | |||
| # for value in graph_real_len_list: | |||
| # pred_idx = find_nearest_idx(graph_pred_len_list_raw, value) | |||
| # graph_pred_list.append(graph_pred_list_raw[pred_idx]) | |||
| # graph_pred_len_list.append(graph_pred_len_list_raw[pred_idx]) | |||
| # # delete | |||
| # graph_pred_len_list_raw=np.delete(graph_pred_len_list_raw, pred_idx) | |||
| # del graph_pred_list_raw[pred_idx] | |||
| # if len(graph_pred_list)==200: | |||
| # break | |||
| # graph_pred_len_list = np.array(graph_pred_len_list) | |||
| # # select pred data within certain range | |||
| # len_min = np.amin(graph_real_len_list) | |||
| # len_max = np.amax(graph_real_len_list) | |||
| # pred_index = np.where((graph_pred_len_list>=len_min)&(graph_pred_len_list<=len_max)) | |||
| # # print(pred_index[0]) | |||
| # graph_pred_list = [graph_pred_list[i] for i in pred_index[0]] | |||
| # graph_pred_len_list = graph_pred_len_list[pred_index[0]] | |||
| # real_order = np.argsort(graph_real_len_list) | |||
| # pred_order = np.argsort(graph_pred_len_list) | |||
| real_order = np.argsort(graph_real_len_list)[::-1] | |||
| pred_order = np.argsort(graph_pred_len_list)[::-1] | |||
| # print(real_order) | |||
| # print(pred_order) | |||
| graph_real_list = [graph_real_list[i] for i in real_order] | |||
| graph_pred_list = [graph_pred_list[i] for i in pred_order] | |||
| # shuffle(graph_real_list) | |||
| # shuffle(graph_pred_list) | |||
| print('real average nodes', sum([graph_real_list[i].number_of_nodes() for i in range(len(graph_real_list))])/len(graph_real_list)) | |||
| print('pred average nodes', sum([graph_pred_list[i].number_of_nodes() for i in range(len(graph_pred_list))])/len(graph_pred_list)) | |||
| print('num of real graphs', len(graph_real_list)) | |||
| print('num of pred graphs', len(graph_pred_list)) | |||
| # # draw all graphs | |||
| # for iter in range(8): | |||
| # print('iter', iter) | |||
| # graph_list = [] | |||
| # for i in range(8): | |||
| # index = 8 * iter + i | |||
| # # graph_real_list[index].remove_nodes_from(list(nx.isolates(graph_real_list[index]))) | |||
| # # graph_pred_list[index].remove_nodes_from(list(nx.isolates(graph_pred_list[index]))) | |||
| # graph_list.append(graph_real_list[index]) | |||
| # graph_list.append(graph_pred_list[index]) | |||
| # print('real', graph_real_list[index].number_of_nodes()) | |||
| # print('pred', graph_pred_list[index].number_of_nodes()) | |||
| # | |||
| # draw_graph_list(graph_list, row=4, col=4, fname=figname + '_' + str(iter)) | |||
| # draw all graphs | |||
| for iter in range(8): | |||
| print('iter', iter) | |||
| graph_list = [] | |||
| for i in range(8): | |||
| index = 32 * iter + i | |||
| # graph_real_list[index].remove_nodes_from(list(nx.isolates(graph_real_list[index]))) | |||
| # graph_pred_list[index].remove_nodes_from(list(nx.isolates(graph_pred_list[index]))) | |||
| # graph_list.append(graph_real_list[index]) | |||
| graph_list.append(graph_pred_list[index]) | |||
| # print('real', graph_real_list[index].number_of_nodes()) | |||
| print('pred', graph_pred_list[index].number_of_nodes()) | |||
| draw_graph_list(graph_list, row=4, col=4, fname=figname + '_' + str(iter)+'_pred') | |||
| # draw all graphs | |||
| for iter in range(8): | |||
| print('iter', iter) | |||
| graph_list = [] | |||
| for i in range(8): | |||
| index = 16 * iter + i | |||
| # graph_real_list[index].remove_nodes_from(list(nx.isolates(graph_real_list[index]))) | |||
| # graph_pred_list[index].remove_nodes_from(list(nx.isolates(graph_pred_list[index]))) | |||
| graph_list.append(graph_real_list[index]) | |||
| # graph_list.append(graph_pred_list[index]) | |||
| print('real', graph_real_list[index].number_of_nodes()) | |||
| # print('pred', graph_pred_list[index].number_of_nodes()) | |||
| draw_graph_list(graph_list, row=4, col=4, fname=figname + '_' + str(iter)+'_real') | |||
| # | |||
| # # for new model | |||
| # elif args.note == 'GraphRNN_structure' and args.is_flexible==False: | |||
| # for num_layers in range(4,5): | |||
| # # give file name and figure name | |||
| # # fname_real = args.graph_save_path + args.note + '_' + args.graph_type + '_' + str(args.graph_node_num) + '_' + \ | |||
| # # str(epoch) + '_real_bptt_' + str(args.bptt)+'_'+str(num_layers)+'_dilation_'+str(args.is_dilation)+'_flexible_'+str(args.is_flexible)+'_bn_'+str(args.is_bn)+'_lr_'+str(args.lr) | |||
| # # fname_pred = args.graph_save_path + args.note + '_' + args.graph_type + '_' + str(args.graph_node_num) + '_' + \ | |||
| # # str(epoch) + '_pred_bptt_' + str(args.bptt)+'_'+str(num_layers)+'_dilation_'+str(args.is_dilation)+'_flexible_'+str(args.is_flexible)+'_bn_'+str(args.is_bn)+'_lr_'+str(args.lr) | |||
| # | |||
| # fname_pred = args.graph_save_path + args.note + '_' + args.graph_type + '_' + \ | |||
| # str(epoch) + '_pred_' + str(args.num_layers) + '_' + str(args.bptt)+ '_' + str(args.bptt_len) + '_' + str(args.hidden_size) | |||
| # fname_real = args.graph_save_path + args.note + '_' + args.graph_type + '_' + \ | |||
| # str(epoch) + '_real_' + str(args.num_layers) + '_' + str(args.bptt)+ '_' + str(args.bptt_len) + '_' + str(args.hidden_size) | |||
| # figname = args.figure_save_path + args.note + '_' + args.graph_type + '_' + \ | |||
| # str(epoch) + '_pred_' + str(args.num_layers) + '_' + str(args.bptt)+ '_' + str(args.bptt_len) + '_' + str(args.hidden_size) | |||
| # print(fname_real) | |||
| # # load data | |||
| # graph_real_list = load_graph_list(fname_real+'.dat') | |||
| # graph_pred_list = load_graph_list(fname_pred+'.dat') | |||
| # | |||
| # graph_real_len_list = np.array([len(graph_real_list[i]) for i in range(len(graph_real_list))]) | |||
| # graph_pred_len_list = np.array([len(graph_pred_list[i]) for i in range(len(graph_pred_list))]) | |||
| # real_order = np.argsort(graph_real_len_list)[::-1] | |||
| # pred_order = np.argsort(graph_pred_len_list)[::-1] | |||
| # # print(real_order) | |||
| # # print(pred_order) | |||
| # graph_real_list = [graph_real_list[i] for i in real_order] | |||
| # graph_pred_list = [graph_pred_list[i] for i in pred_order] | |||
| # | |||
| # shuffle(graph_pred_list) | |||
| # | |||
| # | |||
| # print('real average nodes', | |||
| # sum([graph_real_list[i].number_of_nodes() for i in range(len(graph_real_list))]) / len(graph_real_list)) | |||
| # print('pred average nodes', | |||
| # sum([graph_pred_list[i].number_of_nodes() for i in range(len(graph_pred_list))]) / len(graph_pred_list)) | |||
| # print('num of graphs', len(graph_real_list)) | |||
| # | |||
| # # draw all graphs | |||
| # for iter in range(2): | |||
| # print('iter', iter) | |||
| # graph_list = [] | |||
| # for i in range(8): | |||
| # index = 8*iter + i | |||
| # graph_real_list[index].remove_nodes_from(nx.isolates(graph_real_list[index])) | |||
| # graph_pred_list[index].remove_nodes_from(nx.isolates(graph_pred_list[index])) | |||
| # graph_list.append(graph_real_list[index]) | |||
| # graph_list.append(graph_pred_list[index]) | |||
| # print('real', graph_real_list[index].number_of_nodes()) | |||
| # print('pred', graph_pred_list[index].number_of_nodes()) | |||
| # draw_graph_list(graph_list, row=4, col=4, fname=figname+'_'+str(iter)) | |||
| # | |||
| # | |||
| # # for new model | |||
| # elif args.note == 'GraphRNN_structure' and args.is_flexible==True: | |||
| # for num_layers in range(4,5): | |||
| # graph_real_list = [] | |||
| # graph_pred_list = [] | |||
| # epoch_end = 30000 | |||
| # for epoch in [epoch_end-500*(8-i) for i in range(8)]: | |||
| # # give file name and figure name | |||
| # fname_real = args.graph_save_path + args.note + '_' + args.graph_type + '_' + str(args.graph_node_num) + '_' + \ | |||
| # str(epoch) + '_real_bptt_' + str(args.bptt)+'_'+str(num_layers)+'_dilation_'+str(args.is_dilation)+'_flexible_'+str(args.is_flexible)+'_bn_'+str(args.is_bn)+'_lr_'+str(args.lr) | |||
| # fname_pred = args.graph_save_path + args.note + '_' + args.graph_type + '_' + str(args.graph_node_num) + '_' + \ | |||
| # str(epoch) + '_pred_bptt_' + str(args.bptt)+'_'+str(num_layers)+'_dilation_'+str(args.is_dilation)+'_flexible_'+str(args.is_flexible)+'_bn_'+str(args.is_bn)+'_lr_'+str(args.lr) | |||
| # | |||
| # # load data | |||
| # graph_real_list += load_graph_list(fname_real+'.dat') | |||
| # graph_pred_list += load_graph_list(fname_pred+'.dat') | |||
| # print('num of graphs', len(graph_real_list)) | |||
| # | |||
| # figname = args.figure_save_path + args.note + '_' + args.graph_type + '_' + str(args.graph_node_num) + '_' + \ | |||
| # str(epoch) + str(args.sample_when_validate) + '_' + str(num_layers) + '_dilation_' + str(args.is_dilation) + '_flexible_' + str(args.is_flexible) + '_bn_' + str(args.is_bn) + '_lr_' + str(args.lr) | |||
| # | |||
| # # draw all graphs | |||
| # for iter in range(1): | |||
| # print('iter', iter) | |||
| # graph_list = [] | |||
| # for i in range(8): | |||
| # index = 8*iter + i | |||
| # graph_real_list[index].remove_nodes_from(nx.isolates(graph_real_list[index])) | |||
| # graph_pred_list[index].remove_nodes_from(nx.isolates(graph_pred_list[index])) | |||
| # graph_list.append(graph_real_list[index]) | |||
| # graph_list.append(graph_pred_list[index]) | |||
| # draw_graph_list(graph_list, row=4, col=4, fname=figname+'_'+str(iter)) | |||
| @@ -0,0 +1,110 @@ | |||
| ### program configuration | |||
| class Args(): | |||
| def __init__(self): | |||
| ### if clean tensorboard | |||
| self.clean_tensorboard = False | |||
| ### Which CUDA GPU device is used for training | |||
| self.cuda = 1 | |||
| ### Which GraphRNN model variant is used. | |||
| # The simple version of Graph RNN | |||
| # self.note = 'GraphRNN_MLP' | |||
| # The dependent Bernoulli sequence version of GraphRNN | |||
| self.note = 'GraphRNN_RNN' | |||
| ## for comparison, removing the BFS compoenent | |||
| # self.note = 'GraphRNN_MLP_nobfs' | |||
| # self.note = 'GraphRNN_RNN_nobfs' | |||
| ### Which dataset is used to train the model | |||
| # self.graph_type = 'DD' | |||
| # self.graph_type = 'caveman' | |||
| # self.graph_type = 'caveman_small' | |||
| # self.graph_type = 'caveman_small_single' | |||
| # self.graph_type = 'community4' | |||
| self.graph_type = 'grid' | |||
| # self.graph_type = 'grid_small' | |||
| # self.graph_type = 'ladder_small' | |||
| # self.graph_type = 'enzymes' | |||
| # self.graph_type = 'enzymes_small' | |||
| # self.graph_type = 'barabasi' | |||
| # self.graph_type = 'barabasi_small' | |||
| # self.graph_type = 'citeseer' | |||
| # self.graph_type = 'citeseer_small' | |||
| # self.graph_type = 'barabasi_noise' | |||
| # self.noise = 10 | |||
| # | |||
| # if self.graph_type == 'barabasi_noise': | |||
| # self.graph_type = self.graph_type+str(self.noise) | |||
| # if none, then auto calculate | |||
| self.max_num_node = None # max number of nodes in a graph | |||
| self.max_prev_node = None # max previous node that looks back | |||
| ### network config | |||
| ## GraphRNN | |||
| if 'small' in self.graph_type: | |||
| self.parameter_shrink = 2 | |||
| else: | |||
| self.parameter_shrink = 1 | |||
| self.hidden_size_rnn = int(128/self.parameter_shrink) # hidden size for main RNN | |||
| self.hidden_size_rnn_output = 16 # hidden size for output RNN | |||
| self.embedding_size_rnn = int(64/self.parameter_shrink) # the size for LSTM input | |||
| self.embedding_size_rnn_output = 8 # the embedding size for output rnn | |||
| self.embedding_size_output = int(64/self.parameter_shrink) # the embedding size for output (VAE/MLP) | |||
| self.batch_size = 32 # normal: 32, and the rest should be changed accordingly | |||
| self.test_batch_size = 32 | |||
| self.test_total_size = 1000 | |||
| self.num_layers = 4 | |||
| ### training config | |||
| self.num_workers = 4 # num workers to load data, default 4 | |||
| self.batch_ratio = 32 # how many batches of samples per epoch, default 32, e.g., 1 epoch = 32 batches | |||
| self.epochs = 3000 # now one epoch means self.batch_ratio x batch_size | |||
| self.epochs_test_start = 100 | |||
| self.epochs_test = 100 | |||
| self.epochs_log = 100 | |||
| self.epochs_save = 100 | |||
| self.lr = 0.003 | |||
| self.milestones = [400, 1000] | |||
| self.lr_rate = 0.3 | |||
| self.sample_time = 2 # sample time in each time step, when validating | |||
| ### output config | |||
| # self.dir_input = "/dfs/scratch0/jiaxuany0/" | |||
| self.dir_input = "./" | |||
| self.model_save_path = self.dir_input+'model_save/' # only for nll evaluation | |||
| self.graph_save_path = self.dir_input+'graphs/' | |||
| self.figure_save_path = self.dir_input+'figures/' | |||
| self.timing_save_path = self.dir_input+'timing/' | |||
| self.figure_prediction_save_path = self.dir_input+'figures_prediction/' | |||
| self.nll_save_path = self.dir_input+'nll/' | |||
| self.load = False # if load model, default lr is very low | |||
| self.load_epoch = 3000 | |||
| self.save = True | |||
| ### baseline config | |||
| # self.generator_baseline = 'Gnp' | |||
| self.generator_baseline = 'BA' | |||
| # self.metric_baseline = 'general' | |||
| # self.metric_baseline = 'degree' | |||
| self.metric_baseline = 'clustering' | |||
| ### filenames to save intemediate and final outputs | |||
| self.fname = self.note + '_' + self.graph_type + '_' + str(self.num_layers) + '_' + str(self.hidden_size_rnn) + '_' | |||
| self.fname_pred = self.note+'_'+self.graph_type+'_'+str(self.num_layers)+'_'+ str(self.hidden_size_rnn)+'_pred_' | |||
| self.fname_train = self.note+'_'+self.graph_type+'_'+str(self.num_layers)+'_'+ str(self.hidden_size_rnn)+'_train_' | |||
| self.fname_test = self.note + '_' + self.graph_type + '_' + str(self.num_layers) + '_' + str(self.hidden_size_rnn) + '_test_' | |||
| self.fname_baseline = self.graph_save_path + self.graph_type + self.generator_baseline+'_'+self.metric_baseline | |||
| @@ -0,0 +1,275 @@ | |||
| from main import * | |||
| from scipy.linalg import toeplitz | |||
| import pyemd | |||
| import scipy.optimize as opt | |||
| def Graph_generator_baseline_train_rulebased(graphs,generator='BA'): | |||
| graph_nodes = [graphs[i].number_of_nodes() for i in range(len(graphs))] | |||
| graph_edges = [graphs[i].number_of_edges() for i in range(len(graphs))] | |||
| parameter = {} | |||
| for i in range(len(graph_nodes)): | |||
| nodes = graph_nodes[i] | |||
| edges = graph_edges[i] | |||
| # based on rule, calculate optimal parameter | |||
| if generator=='BA': | |||
| # BA optimal: nodes = n; edges = (n-m)*m | |||
| n = nodes | |||
| m = (n - np.sqrt(n**2-4*edges))/2 | |||
| parameter_temp = [n,m,1] | |||
| if generator=='Gnp': | |||
| # Gnp optimal: nodes = n; edges = ((n-1)*n/2)*p | |||
| n = nodes | |||
| p = float(edges)/((n-1)*n/2) | |||
| parameter_temp = [n,p,1] | |||
| # update parameter list | |||
| if nodes not in parameter.keys(): | |||
| parameter[nodes] = parameter_temp | |||
| else: | |||
| count = parameter[nodes][-1] | |||
| parameter[nodes] = [(parameter[nodes][i]*count+parameter_temp[i])/(count+1) for i in range(len(parameter[nodes]))] | |||
| parameter[nodes][-1] = count+1 | |||
| # print(parameter) | |||
| return parameter | |||
| def Graph_generator_baseline(graph_train, pred_num=1000, generator='BA'): | |||
| graph_nodes = [graph_train[i].number_of_nodes() for i in range(len(graph_train))] | |||
| graph_edges = [graph_train[i].number_of_edges() for i in range(len(graph_train))] | |||
| repeat = pred_num//len(graph_train) | |||
| graph_pred = [] | |||
| for i in range(len(graph_nodes)): | |||
| nodes = graph_nodes[i] | |||
| edges = graph_edges[i] | |||
| # based on rule, calculate optimal parameter | |||
| if generator=='BA': | |||
| # BA optimal: nodes = n; edges = (n-m)*m | |||
| n = nodes | |||
| m = int((n - np.sqrt(n**2-4*edges))/2) | |||
| for j in range(repeat): | |||
| graph_pred.append(nx.barabasi_albert_graph(n,m)) | |||
| if generator=='Gnp': | |||
| # Gnp optimal: nodes = n; edges = ((n-1)*n/2)*p | |||
| n = nodes | |||
| p = float(edges)/((n-1)*n/2) | |||
| for j in range(repeat): | |||
| graph_pred.append(nx.fast_gnp_random_graph(n, p)) | |||
| return graph_pred | |||
| def emd_distance(x, y, distance_scaling=1.0): | |||
| support_size = max(len(x), len(y)) | |||
| d_mat = toeplitz(range(support_size)).astype(np.float) | |||
| distance_mat = d_mat / distance_scaling | |||
| # convert histogram values x and y to float, and make them equal len | |||
| x = x.astype(np.float) | |||
| y = y.astype(np.float) | |||
| if len(x) < len(y): | |||
| x = np.hstack((x, [0.0] * (support_size - len(x)))) | |||
| elif len(y) < len(x): | |||
| y = np.hstack((y, [0.0] * (support_size - len(y)))) | |||
| emd = pyemd.emd(x, y, distance_mat) | |||
| return emd | |||
| # def Loss(x,args): | |||
| # ''' | |||
| # | |||
| # :param x: 1-D array, parameters to be optimized | |||
| # :param args: tuple (n, G, generator, metric). | |||
| # n: n for pred graph; | |||
| # G: real graph in networkx format; | |||
| # generator: 'BA', 'Gnp', 'Powerlaw'; | |||
| # metric: 'degree', 'clustering' | |||
| # :return: Loss: emd distance | |||
| # ''' | |||
| # # get argument | |||
| # generator = args[2] | |||
| # metric = args[3] | |||
| # | |||
| # # get real and pred graphs | |||
| # G_real = args[1] | |||
| # if generator=='BA': | |||
| # G_pred = nx.barabasi_albert_graph(args[0],int(np.rint(x))) | |||
| # if generator=='Gnp': | |||
| # G_pred = nx.fast_gnp_random_graph(args[0],x) | |||
| # | |||
| # # define metric | |||
| # if metric == 'degree': | |||
| # G_real_hist = np.array(nx.degree_histogram(G_real)) | |||
| # G_real_hist = G_real_hist / np.sum(G_real_hist) | |||
| # G_pred_hist = np.array(nx.degree_histogram(G_pred)) | |||
| # G_pred_hist = G_pred_hist/np.sum(G_pred_hist) | |||
| # if metric == 'clustering': | |||
| # G_real_hist, _ = np.histogram( | |||
| # np.array(list(nx.clustering(G_real).values())), bins=50, range=(0.0, 1.0), density=False) | |||
| # G_real_hist = G_real_hist / np.sum(G_real_hist) | |||
| # G_pred_hist, _ = np.histogram( | |||
| # np.array(list(nx.clustering(G_pred).values())), bins=50, range=(0.0, 1.0), density=False) | |||
| # G_pred_hist = G_pred_hist / np.sum(G_pred_hist) | |||
| # | |||
| # loss = emd_distance(G_real_hist,G_pred_hist) | |||
| # return loss | |||
| def Loss(x,n,G_real,generator,metric): | |||
| ''' | |||
| :param x: 1-D array, parameters to be optimized | |||
| :param | |||
| n: n for pred graph; | |||
| G: real graph in networkx format; | |||
| generator: 'BA', 'Gnp', 'Powerlaw'; | |||
| metric: 'degree', 'clustering' | |||
| :return: Loss: emd distance | |||
| ''' | |||
| # get argument | |||
| # get real and pred graphs | |||
| if generator=='BA': | |||
| G_pred = nx.barabasi_albert_graph(n,int(np.rint(x))) | |||
| if generator=='Gnp': | |||
| G_pred = nx.fast_gnp_random_graph(n,x) | |||
| # define metric | |||
| if metric == 'degree': | |||
| G_real_hist = np.array(nx.degree_histogram(G_real)) | |||
| G_real_hist = G_real_hist / np.sum(G_real_hist) | |||
| G_pred_hist = np.array(nx.degree_histogram(G_pred)) | |||
| G_pred_hist = G_pred_hist/np.sum(G_pred_hist) | |||
| if metric == 'clustering': | |||
| G_real_hist, _ = np.histogram( | |||
| np.array(list(nx.clustering(G_real).values())), bins=50, range=(0.0, 1.0), density=False) | |||
| G_real_hist = G_real_hist / np.sum(G_real_hist) | |||
| G_pred_hist, _ = np.histogram( | |||
| np.array(list(nx.clustering(G_pred).values())), bins=50, range=(0.0, 1.0), density=False) | |||
| G_pred_hist = G_pred_hist / np.sum(G_pred_hist) | |||
| loss = emd_distance(G_real_hist,G_pred_hist) | |||
| return loss | |||
| def optimizer_brute(x_min, x_max, x_step, n, G_real, generator, metric): | |||
| loss_all = [] | |||
| x_list = np.arange(x_min,x_max,x_step) | |||
| for x_test in x_list: | |||
| loss_all.append(Loss(x_test,n,G_real,generator,metric)) | |||
| x_optim = x_list[np.argmin(np.array(loss_all))] | |||
| return x_optim | |||
| def Graph_generator_baseline_train_optimizationbased(graphs,generator='BA',metric='degree'): | |||
| graph_nodes = [graphs[i].number_of_nodes() for i in range(len(graphs))] | |||
| parameter = {} | |||
| for i in range(len(graph_nodes)): | |||
| print('graph ',i) | |||
| nodes = graph_nodes[i] | |||
| if generator=='BA': | |||
| n = nodes | |||
| m = optimizer_brute(1,10,1, nodes, graphs[i], generator, metric) | |||
| parameter_temp = [n,m,1] | |||
| elif generator=='Gnp': | |||
| n = nodes | |||
| p = optimizer_brute(1e-6,1,0.01, nodes, graphs[i], generator, metric) | |||
| ## if use evolution | |||
| # result = opt.differential_evolution(Loss,bounds=[(0,1)],args=(nodes, graphs[i], generator, metric),maxiter=1000) | |||
| # p = result.x | |||
| parameter_temp = [n, p, 1] | |||
| # update parameter list | |||
| if nodes not in parameter.keys(): | |||
| parameter[nodes] = parameter_temp | |||
| else: | |||
| count = parameter[nodes][2] | |||
| parameter[nodes] = [(parameter[nodes][i]*count+parameter_temp[i])/(count+1) for i in range(len(parameter[nodes]))] | |||
| parameter[nodes][2] = count+1 | |||
| print(parameter) | |||
| return parameter | |||
| def Graph_generator_baseline_test(graph_nodes, parameter, generator='BA'): | |||
| graphs = [] | |||
| for i in range(len(graph_nodes)): | |||
| nodes = graph_nodes[i] | |||
| if not nodes in parameter.keys(): | |||
| nodes = min(parameter.keys(), key=lambda k: abs(k - nodes)) | |||
| if generator=='BA': | |||
| n = int(parameter[nodes][0]) | |||
| m = int(np.rint(parameter[nodes][1])) | |||
| print(n,m) | |||
| graph = nx.barabasi_albert_graph(n,m) | |||
| if generator=='Gnp': | |||
| n = int(parameter[nodes][0]) | |||
| p = parameter[nodes][1] | |||
| print(n,p) | |||
| graph = nx.fast_gnp_random_graph(n,p) | |||
| graphs.append(graph) | |||
| return graphs | |||
| if __name__ == '__main__': | |||
| args = Args() | |||
| print('File name prefix', args.fname) | |||
| ### load datasets | |||
| graphs = [] | |||
| # synthetic graphs | |||
| if args.graph_type=='ladder': | |||
| graphs = [] | |||
| for i in range(100, 201): | |||
| graphs.append(nx.ladder_graph(i)) | |||
| args.max_prev_node = 10 | |||
| if args.graph_type=='tree': | |||
| graphs = [] | |||
| for i in range(2,5): | |||
| for j in range(3,5): | |||
| graphs.append(nx.balanced_tree(i,j)) | |||
| args.max_prev_node = 256 | |||
| if args.graph_type=='caveman': | |||
| graphs = [] | |||
| for i in range(5,10): | |||
| for j in range(5,25): | |||
| graphs.append(nx.connected_caveman_graph(i, j)) | |||
| args.max_prev_node = 50 | |||
| if args.graph_type=='grid': | |||
| graphs = [] | |||
| for i in range(10,20): | |||
| for j in range(10,20): | |||
| graphs.append(nx.grid_2d_graph(i,j)) | |||
| args.max_prev_node = 40 | |||
| if args.graph_type=='barabasi': | |||
| graphs = [] | |||
| for i in range(100,200): | |||
| graphs.append(nx.barabasi_albert_graph(i,2)) | |||
| args.max_prev_node = 130 | |||
| # real graphs | |||
| if args.graph_type == 'enzymes': | |||
| graphs= Graph_load_batch(min_num_nodes=10, name='ENZYMES') | |||
| args.max_prev_node = 25 | |||
| if args.graph_type == 'protein': | |||
| graphs = Graph_load_batch(min_num_nodes=20, name='PROTEINS_full') | |||
| args.max_prev_node = 80 | |||
| if args.graph_type == 'DD': | |||
| graphs = Graph_load_batch(min_num_nodes=100, max_num_nodes=500, name='DD',node_attributes=False,graph_labels=True) | |||
| args.max_prev_node = 230 | |||
| graph_nodes = [graphs[i].number_of_nodes() for i in range(len(graphs))] | |||
| graph_edges = [graphs[i].number_of_edges() for i in range(len(graphs))] | |||
| args.max_num_node = max(graph_nodes) | |||
| # show graphs statistics | |||
| print('total graph num: {}'.format(len(graphs))) | |||
| print('max number node: {}'.format(args.max_num_node)) | |||
| print('max previous node: {}'.format(args.max_prev_node)) | |||
| # start baseline generation method | |||
| generator = args.generator_baseline | |||
| metric = args.metric_baseline | |||
| print(args.fname_baseline + '.dat') | |||
| if metric=='general': | |||
| parameter = Graph_generator_baseline_train_rulebased(graphs,generator=generator) | |||
| else: | |||
| parameter = Graph_generator_baseline_train_optimizationbased(graphs,generator=generator,metric=metric) | |||
| graphs_generated = Graph_generator_baseline_test(graph_nodes, parameter,generator) | |||
| save_graph_list(graphs_generated,args.fname_baseline + '.dat') | |||
| @@ -0,0 +1,58 @@ | |||
| import networkx as nx | |||
| import numpy as np | |||
| import torch | |||
| class GraphAdjSampler(torch.utils.data.Dataset): | |||
| def __init__(self, G_list, max_num_nodes, features='id'): | |||
| self.max_num_nodes = max_num_nodes | |||
| self.adj_all = [] | |||
| self.len_all = [] | |||
| self.feature_all = [] | |||
| for G in G_list: | |||
| adj = nx.to_numpy_matrix(G) | |||
| # the diagonal entries are 1 since they denote node probability | |||
| self.adj_all.append( | |||
| np.asarray(adj) + np.identity(G.number_of_nodes())) | |||
| self.len_all.append(G.number_of_nodes()) | |||
| if features == 'id': | |||
| self.feature_all.append(np.identity(max_num_nodes)) | |||
| elif features == 'deg': | |||
| degs = np.sum(np.array(adj), 1) | |||
| degs = np.expand_dims(np.pad(degs, [0, max_num_nodes - G.number_of_nodes()], 0), | |||
| axis=1) | |||
| self.feature_all.append(degs) | |||
| elif features == 'struct': | |||
| degs = np.sum(np.array(adj), 1) | |||
| degs = np.expand_dims(np.pad(degs, [0, max_num_nodes - G.number_of_nodes()], | |||
| 'constant'), | |||
| axis=1) | |||
| clusterings = np.array(list(nx.clustering(G).values())) | |||
| clusterings = np.expand_dims(np.pad(clusterings, | |||
| [0, max_num_nodes - G.number_of_nodes()], | |||
| 'constant'), | |||
| axis=1) | |||
| self.feature_all.append(np.hstack([degs, clusterings])) | |||
| def __len__(self): | |||
| return len(self.adj_all) | |||
| def __getitem__(self, idx): | |||
| adj = self.adj_all[idx] | |||
| num_nodes = adj.shape[0] | |||
| adj_padded = np.zeros((self.max_num_nodes, self.max_num_nodes)) | |||
| adj_padded[:num_nodes, :num_nodes] = adj | |||
| adj_decoded = np.zeros(self.max_num_nodes * (self.max_num_nodes + 1) // 2) | |||
| node_idx = 0 | |||
| adj_vectorized = adj_padded[np.triu(np.ones((self.max_num_nodes,self.max_num_nodes)) ) == 1] | |||
| # the following 2 lines recover the upper triangle of the adj matrix | |||
| #recovered = np.zeros((self.max_num_nodes, self.max_num_nodes)) | |||
| #recovered[np.triu(np.ones((self.max_num_nodes, self.max_num_nodes)) ) == 1] = adj_vectorized | |||
| #print(recovered) | |||
| return {'adj':adj_padded, | |||
| 'adj_decoded':adj_vectorized, | |||
| 'features':self.feature_all[idx].copy()} | |||
| @@ -0,0 +1,208 @@ | |||
| import numpy as np | |||
| import scipy.optimize | |||
| import torch | |||
| import torch.nn as nn | |||
| from torch.autograd import Variable | |||
| from torch import optim | |||
| import torch.nn.functional as F | |||
| import torch.nn.init as init | |||
| import model | |||
| class GraphVAE(nn.Module): | |||
| def __init__(self, input_dim, hidden_dim, latent_dim, max_num_nodes, pool='sum'): | |||
| ''' | |||
| Args: | |||
| input_dim: input feature dimension for node. | |||
| hidden_dim: hidden dim for 2-layer gcn. | |||
| latent_dim: dimension of the latent representation of graph. | |||
| ''' | |||
| super(GraphVAE, self).__init__() | |||
| self.conv1 = model.GraphConv(input_dim=input_dim, output_dim=hidden_dim) | |||
| self.bn1 = nn.BatchNorm1d(input_dim) | |||
| self.conv2 = model.GraphConv(input_dim=hidden_dim, output_dim=hidden_dim) | |||
| self.bn2 = nn.BatchNorm1d(input_dim) | |||
| self.act = nn.ReLU() | |||
| output_dim = max_num_nodes * (max_num_nodes + 1) // 2 | |||
| #self.vae = model.MLP_VAE_plain(hidden_dim, latent_dim, output_dim) | |||
| self.vae = model.MLP_VAE_plain(input_dim * input_dim, latent_dim, output_dim) | |||
| #self.feature_mlp = model.MLP_plain(latent_dim, latent_dim, output_dim) | |||
| self.max_num_nodes = max_num_nodes | |||
| for m in self.modules(): | |||
| if isinstance(m, model.GraphConv): | |||
| m.weight.data = init.xavier_uniform(m.weight.data, gain=nn.init.calculate_gain('relu')) | |||
| elif isinstance(m, nn.BatchNorm1d): | |||
| m.weight.data.fill_(1) | |||
| m.bias.data.zero_() | |||
| self.pool = pool | |||
| def recover_adj_lower(self, l): | |||
| # NOTE: Assumes 1 per minibatch | |||
| adj = torch.zeros(self.max_num_nodes, self.max_num_nodes) | |||
| adj[torch.triu(torch.ones(self.max_num_nodes, self.max_num_nodes)) == 1] = l | |||
| return adj | |||
| def recover_full_adj_from_lower(self, lower): | |||
| diag = torch.diag(torch.diag(lower, 0)) | |||
| return lower + torch.transpose(lower, 0, 1) - diag | |||
| def edge_similarity_matrix(self, adj, adj_recon, matching_features, | |||
| matching_features_recon, sim_func): | |||
| S = torch.zeros(self.max_num_nodes, self.max_num_nodes, | |||
| self.max_num_nodes, self.max_num_nodes) | |||
| for i in range(self.max_num_nodes): | |||
| for j in range(self.max_num_nodes): | |||
| if i == j: | |||
| for a in range(self.max_num_nodes): | |||
| S[i, i, a, a] = adj[i, i] * adj_recon[a, a] * \ | |||
| sim_func(matching_features[i], matching_features_recon[a]) | |||
| # with feature not implemented | |||
| # if input_features is not None: | |||
| else: | |||
| for a in range(self.max_num_nodes): | |||
| for b in range(self.max_num_nodes): | |||
| if b == a: | |||
| continue | |||
| S[i, j, a, b] = adj[i, j] * adj[i, i] * adj[j, j] * \ | |||
| adj_recon[a, b] * adj_recon[a, a] * adj_recon[b, b] | |||
| return S | |||
| def mpm(self, x_init, S, max_iters=50): | |||
| x = x_init | |||
| for it in range(max_iters): | |||
| x_new = torch.zeros(self.max_num_nodes, self.max_num_nodes) | |||
| for i in range(self.max_num_nodes): | |||
| for a in range(self.max_num_nodes): | |||
| x_new[i, a] = x[i, a] * S[i, i, a, a] | |||
| pooled = [torch.max(x[j, :] * S[i, j, a, :]) | |||
| for j in range(self.max_num_nodes) if j != i] | |||
| neigh_sim = sum(pooled) | |||
| x_new[i, a] += neigh_sim | |||
| norm = torch.norm(x_new) | |||
| x = x_new / norm | |||
| return x | |||
| def deg_feature_similarity(self, f1, f2): | |||
| return 1 / (abs(f1 - f2) + 1) | |||
| def permute_adj(self, adj, curr_ind, target_ind): | |||
| ''' Permute adjacency matrix. | |||
| The target_ind (connectivity) should be permuted to the curr_ind position. | |||
| ''' | |||
| # order curr_ind according to target ind | |||
| ind = np.zeros(self.max_num_nodes, dtype=np.int) | |||
| ind[target_ind] = curr_ind | |||
| adj_permuted = torch.zeros((self.max_num_nodes, self.max_num_nodes)) | |||
| adj_permuted[:, :] = adj[ind, :] | |||
| adj_permuted[:, :] = adj_permuted[:, ind] | |||
| return adj_permuted | |||
| def pool_graph(self, x): | |||
| if self.pool == 'max': | |||
| out, _ = torch.max(x, dim=1, keepdim=False) | |||
| elif self.pool == 'sum': | |||
| out = torch.sum(x, dim=1, keepdim=False) | |||
| return out | |||
| def forward(self, input_features, adj): | |||
| #x = self.conv1(input_features, adj) | |||
| #x = self.bn1(x) | |||
| #x = self.act(x) | |||
| #x = self.conv2(x, adj) | |||
| #x = self.bn2(x) | |||
| # pool over all nodes | |||
| #graph_h = self.pool_graph(x) | |||
| graph_h = input_features.view(-1, self.max_num_nodes * self.max_num_nodes) | |||
| # vae | |||
| h_decode, z_mu, z_lsgms = self.vae(graph_h) | |||
| out = F.sigmoid(h_decode) | |||
| out_tensor = out.cpu().data | |||
| recon_adj_lower = self.recover_adj_lower(out_tensor) | |||
| recon_adj_tensor = self.recover_full_adj_from_lower(recon_adj_lower) | |||
| # set matching features be degree | |||
| out_features = torch.sum(recon_adj_tensor, 1) | |||
| adj_data = adj.cpu().data[0] | |||
| adj_features = torch.sum(adj_data, 1) | |||
| S = self.edge_similarity_matrix(adj_data, recon_adj_tensor, adj_features, out_features, | |||
| self.deg_feature_similarity) | |||
| # initialization strategies | |||
| init_corr = 1 / self.max_num_nodes | |||
| init_assignment = torch.ones(self.max_num_nodes, self.max_num_nodes) * init_corr | |||
| #init_assignment = torch.FloatTensor(4, 4) | |||
| #init.uniform(init_assignment) | |||
| assignment = self.mpm(init_assignment, S) | |||
| #print('Assignment: ', assignment) | |||
| # matching | |||
| # use negative of the assignment score since the alg finds min cost flow | |||
| row_ind, col_ind = scipy.optimize.linear_sum_assignment(-assignment.numpy()) | |||
| print('row: ', row_ind) | |||
| print('col: ', col_ind) | |||
| # order row index according to col index | |||
| #adj_permuted = self.permute_adj(adj_data, row_ind, col_ind) | |||
| adj_permuted = adj_data | |||
| adj_vectorized = adj_permuted[torch.triu(torch.ones(self.max_num_nodes,self.max_num_nodes) )== 1].squeeze_() | |||
| adj_vectorized_var = Variable(adj_vectorized).cuda() | |||
| #print(adj) | |||
| #print('permuted: ', adj_permuted) | |||
| #print('recon: ', recon_adj_tensor) | |||
| adj_recon_loss = self.adj_recon_loss(adj_vectorized_var, out[0]) | |||
| print('recon: ', adj_recon_loss) | |||
| print(adj_vectorized_var) | |||
| print(out[0]) | |||
| loss_kl = -0.5 * torch.sum(1 + z_lsgms - z_mu.pow(2) - z_lsgms.exp()) | |||
| loss_kl /= self.max_num_nodes * self.max_num_nodes # normalize | |||
| print('kl: ', loss_kl) | |||
| loss = adj_recon_loss + loss_kl | |||
| return loss | |||
| def forward_test(self, input_features, adj): | |||
| self.max_num_nodes = 4 | |||
| adj_data = torch.zeros(self.max_num_nodes, self.max_num_nodes) | |||
| adj_data[:4, :4] = torch.FloatTensor([[1,1,0,0], [1,1,1,0], [0,1,1,1], [0,0,1,1]]) | |||
| adj_features = torch.Tensor([2,3,3,2]) | |||
| adj_data1 = torch.zeros(self.max_num_nodes, self.max_num_nodes) | |||
| adj_data1 = torch.FloatTensor([[1,1,1,0], [1,1,0,1], [1,0,1,0], [0,1,0,1]]) | |||
| adj_features1 = torch.Tensor([3,3,2,2]) | |||
| S = self.edge_similarity_matrix(adj_data, adj_data1, adj_features, adj_features1, | |||
| self.deg_feature_similarity) | |||
| # initialization strategies | |||
| init_corr = 1 / self.max_num_nodes | |||
| init_assignment = torch.ones(self.max_num_nodes, self.max_num_nodes) * init_corr | |||
| #init_assignment = torch.FloatTensor(4, 4) | |||
| #init.uniform(init_assignment) | |||
| assignment = self.mpm(init_assignment, S) | |||
| #print('Assignment: ', assignment) | |||
| # matching | |||
| row_ind, col_ind = scipy.optimize.linear_sum_assignment(-assignment.numpy()) | |||
| print('row: ', row_ind) | |||
| print('col: ', col_ind) | |||
| permuted_adj = self.permute_adj(adj_data, row_ind, col_ind) | |||
| print('permuted: ', permuted_adj) | |||
| adj_recon_loss = self.adj_recon_loss(permuted_adj, adj_data1) | |||
| print(adj_data1) | |||
| print('diff: ', adj_recon_loss) | |||
| def adj_recon_loss(self, adj_truth, adj_pred): | |||
| return F.binary_cross_entropy(adj_truth, adj_pred) | |||
| @@ -0,0 +1,132 @@ | |||
| import argparse | |||
| import matplotlib.pyplot as plt | |||
| import networkx as nx | |||
| import numpy as np | |||
| import os | |||
| from random import shuffle | |||
| import torch | |||
| import torch.nn as nn | |||
| import torch.nn.init as init | |||
| from torch.autograd import Variable | |||
| import torch.nn.functional as F | |||
| from torch import optim | |||
| from torch.optim.lr_scheduler import MultiStepLR | |||
| import data | |||
| from baselines.graphvae.model import GraphVAE | |||
| from baselines.graphvae.data import GraphAdjSampler | |||
| CUDA = 2 | |||
| LR_milestones = [500, 1000] | |||
| def build_model(args, max_num_nodes): | |||
| out_dim = max_num_nodes * (max_num_nodes + 1) // 2 | |||
| if args.feature_type == 'id': | |||
| input_dim = max_num_nodes | |||
| elif args.feature_type == 'deg': | |||
| input_dim = 1 | |||
| elif args.feature_type == 'struct': | |||
| input_dim = 2 | |||
| model = GraphVAE(input_dim, 64, 256, max_num_nodes) | |||
| return model | |||
| def train(args, dataloader, model): | |||
| epoch = 1 | |||
| optimizer = optim.Adam(list(model.parameters()), lr=args.lr) | |||
| scheduler = MultiStepLR(optimizer, milestones=LR_milestones, gamma=args.lr) | |||
| model.train() | |||
| for epoch in range(5000): | |||
| for batch_idx, data in enumerate(dataloader): | |||
| model.zero_grad() | |||
| features = data['features'].float() | |||
| adj_input = data['adj'].float() | |||
| features = Variable(features).cuda() | |||
| adj_input = Variable(adj_input).cuda() | |||
| loss = model(features, adj_input) | |||
| print('Epoch: ', epoch, ', Iter: ', batch_idx, ', Loss: ', loss) | |||
| loss.backward() | |||
| optimizer.step() | |||
| scheduler.step() | |||
| break | |||
| def arg_parse(): | |||
| parser = argparse.ArgumentParser(description='GraphVAE arguments.') | |||
| io_parser = parser.add_mutually_exclusive_group(required=False) | |||
| io_parser.add_argument('--dataset', dest='dataset', | |||
| help='Input dataset.') | |||
| parser.add_argument('--lr', dest='lr', type=float, | |||
| help='Learning rate.') | |||
| parser.add_argument('--batch_size', dest='batch_size', type=int, | |||
| help='Batch size.') | |||
| parser.add_argument('--num_workers', dest='num_workers', type=int, | |||
| help='Number of workers to load data.') | |||
| parser.add_argument('--max_num_nodes', dest='max_num_nodes', type=int, | |||
| help='Predefined maximum number of nodes in train/test graphs. -1 if determined by \ | |||
| training data.') | |||
| parser.add_argument('--feature', dest='feature_type', | |||
| help='Feature used for encoder. Can be: id, deg') | |||
| parser.set_defaults(dataset='grid', | |||
| feature_type='id', | |||
| lr=0.001, | |||
| batch_size=1, | |||
| num_workers=1, | |||
| max_num_nodes=-1) | |||
| return parser.parse_args() | |||
| def main(): | |||
| prog_args = arg_parse() | |||
| os.environ['CUDA_VISIBLE_DEVICES'] = str(CUDA) | |||
| print('CUDA', CUDA) | |||
| ### running log | |||
| if prog_args.dataset == 'enzymes': | |||
| graphs= data.Graph_load_batch(min_num_nodes=10, name='ENZYMES') | |||
| num_graphs_raw = len(graphs) | |||
| elif prog_args.dataset == 'grid': | |||
| graphs = [] | |||
| for i in range(2,3): | |||
| for j in range(2,3): | |||
| graphs.append(nx.grid_2d_graph(i,j)) | |||
| num_graphs_raw = len(graphs) | |||
| if prog_args.max_num_nodes == -1: | |||
| max_num_nodes = max([graphs[i].number_of_nodes() for i in range(len(graphs))]) | |||
| else: | |||
| max_num_nodes = prog_args.max_num_nodes | |||
| # remove graphs with number of nodes greater than max_num_nodes | |||
| graphs = [g for g in graphs if g.number_of_nodes() <= max_num_nodes] | |||
| graphs_len = len(graphs) | |||
| print('Number of graphs removed due to upper-limit of number of nodes: ', | |||
| num_graphs_raw - graphs_len) | |||
| graphs_test = graphs[int(0.8 * graphs_len):] | |||
| #graphs_train = graphs[0:int(0.8*graphs_len)] | |||
| graphs_train = graphs | |||
| print('total graph num: {}, training set: {}'.format(len(graphs),len(graphs_train))) | |||
| print('max number node: {}'.format(max_num_nodes)) | |||
| dataset = GraphAdjSampler(graphs_train, max_num_nodes, features=prog_args.feature_type) | |||
| #sample_strategy = torch.utils.data.sampler.WeightedRandomSampler( | |||
| # [1.0 / len(dataset) for i in range(len(dataset))], | |||
| # num_samples=prog_args.batch_size, | |||
| # replacement=False) | |||
| dataset_loader = torch.utils.data.DataLoader( | |||
| dataset, | |||
| batch_size=prog_args.batch_size, | |||
| num_workers=prog_args.num_workers) | |||
| model = build_model(prog_args, max_num_nodes).cuda() | |||
| train(prog_args, dataset_loader, model) | |||
| if __name__ == '__main__': | |||
| main() | |||
| @@ -0,0 +1,154 @@ | |||
| """Stochastic block model.""" | |||
| import argparse | |||
| import os | |||
| from time import time | |||
| import edward as ed | |||
| import networkx as nx | |||
| import numpy as np | |||
| import tensorflow as tf | |||
| from edward.models import Bernoulli, Multinomial, Beta, Dirichlet, PointMass, Normal | |||
| from observations import karate | |||
| from sklearn.metrics.cluster import adjusted_rand_score | |||
| import utils | |||
| CUDA = 2 | |||
| ed.set_seed(int(time())) | |||
| #ed.set_seed(42) | |||
| # DATA | |||
| #X_data, Z_true = karate("data") | |||
| def disjoint_cliques_test_graph(num_cliques, clique_size): | |||
| G = nx.disjoint_union_all([nx.complete_graph(clique_size) for _ in range(num_cliques)]) | |||
| return nx.to_numpy_matrix(G) | |||
| def mmsb(N, K, data): | |||
| # sparsity | |||
| rho = 0.3 | |||
| # MODEL | |||
| # probability of belonging to each of K blocks for each node | |||
| gamma = Dirichlet(concentration=tf.ones([K])) | |||
| # block connectivity | |||
| Pi = Beta(concentration0=tf.ones([K, K]), concentration1=tf.ones([K, K])) | |||
| # probability of belonging to each of K blocks for all nodes | |||
| Z = Multinomial(total_count=1.0, probs=gamma, sample_shape=N) | |||
| # adjacency | |||
| X = Bernoulli(probs=(1 - rho) * tf.matmul(Z, tf.matmul(Pi, tf.transpose(Z)))) | |||
| # INFERENCE (EM algorithm) | |||
| qgamma = PointMass(params=tf.nn.softmax(tf.Variable(tf.random_normal([K])))) | |||
| qPi = PointMass(params=tf.nn.sigmoid(tf.Variable(tf.random_normal([K, K])))) | |||
| qZ = PointMass(params=tf.nn.softmax(tf.Variable(tf.random_normal([N, K])))) | |||
| #qgamma = Normal(loc=tf.get_variable("qgamma/loc", [K]), | |||
| # scale=tf.nn.softplus( | |||
| # tf.get_variable("qgamma/scale", [K]))) | |||
| #qPi = Normal(loc=tf.get_variable("qPi/loc", [K, K]), | |||
| # scale=tf.nn.softplus( | |||
| # tf.get_variable("qPi/scale", [K, K]))) | |||
| #qZ = Normal(loc=tf.get_variable("qZ/loc", [N, K]), | |||
| # scale=tf.nn.softplus( | |||
| # tf.get_variable("qZ/scale", [N, K]))) | |||
| #inference = ed.KLqp({gamma: qgamma, Pi: qPi, Z: qZ}, data={X: data}) | |||
| inference = ed.MAP({gamma: qgamma, Pi: qPi, Z: qZ}, data={X: data}) | |||
| #inference.run() | |||
| n_iter = 6000 | |||
| inference.initialize(optimizer=tf.train.AdamOptimizer(learning_rate=0.01), n_iter=n_iter) | |||
| tf.global_variables_initializer().run() | |||
| for _ in range(inference.n_iter): | |||
| info_dict = inference.update() | |||
| inference.print_progress(info_dict) | |||
| inference.finalize() | |||
| print('qgamma after: ', qgamma.mean().eval()) | |||
| return qZ.mean().eval(), qPi.eval() | |||
| def arg_parse(): | |||
| parser = argparse.ArgumentParser(description='MMSB arguments.') | |||
| parser.add_argument('--dataset', dest='dataset', | |||
| help='Input dataset.') | |||
| parser.add_argument('--K', dest='K', type=int, | |||
| help='Number of blocks.') | |||
| parser.add_argument('--samples-per-G', dest='samples', type=int, | |||
| help='Number of samples for every graph.') | |||
| parser.set_defaults(dataset='community', | |||
| K=4, | |||
| samples=1) | |||
| return parser.parse_args() | |||
| def graph_gen_from_blockmodel(B, Z): | |||
| n_blocks = len(B) | |||
| B = np.array(B) | |||
| Z = np.array(Z) | |||
| adj_prob = np.dot(Z, np.dot(B, np.transpose(Z))) | |||
| adj = np.random.binomial(1, adj_prob * 0.3) | |||
| return nx.from_numpy_matrix(adj) | |||
| if __name__ == '__main__': | |||
| prog_args = arg_parse() | |||
| os.environ['CUDA_VISIBLE_DEVICES'] = str(CUDA) | |||
| print('CUDA', CUDA) | |||
| X_dataset = [] | |||
| #X_data = nx.to_numpy_matrix(nx.connected_caveman_graph(4, 7)) | |||
| if prog_args.dataset == 'clique_test': | |||
| X_data = disjoint_cliques_test_graph(4, 7) | |||
| X_dataset.append(X_data) | |||
| elif prog_args.dataset == 'citeseer': | |||
| graphs = utils.citeseer_ego() | |||
| X_dataset = [nx.to_numpy_matrix(g) for g in graphs] | |||
| elif prog_args.dataset == 'community': | |||
| graphs = [] | |||
| for i in range(2, 3): | |||
| for j in range(30, 81): | |||
| for k in range(10): | |||
| graphs.append(utils.caveman_special(i,j, p_edge=0.3)) | |||
| X_dataset = [nx.to_numpy_matrix(g) for g in graphs] | |||
| elif prog_args.dataset == 'grid': | |||
| graphs = [] | |||
| for i in range(10,20): | |||
| for j in range(10,20): | |||
| graphs.append(nx.grid_2d_graph(i,j)) | |||
| X_dataset = [nx.to_numpy_matrix(g) for g in graphs] | |||
| elif prog_args.dataset.startswith('community'): | |||
| graphs = [] | |||
| num_communities = int(prog_args.dataset[-1]) | |||
| print('Creating dataset with ', num_communities, ' communities') | |||
| c_sizes = np.random.choice([12, 13, 14, 15, 16, 17], num_communities) | |||
| for k in range(3000): | |||
| graphs.append(utils.n_community(c_sizes, p_inter=0.01)) | |||
| X_dataset = [nx.to_numpy_matrix(g) for g in graphs] | |||
| print('Number of graphs: ', len(X_dataset)) | |||
| K = prog_args.K # number of clusters | |||
| gen_graphs = [] | |||
| for i in range(len(X_dataset)): | |||
| if i % 5 == 0: | |||
| print(i) | |||
| X_data = X_dataset[i] | |||
| N = X_data.shape[0] # number of vertices | |||
| Zp, B = mmsb(N, K, X_data) | |||
| #print("Block: ", B) | |||
| Z_pred = Zp.argmax(axis=1) | |||
| print("Result (label flip can happen):") | |||
| #print("prob: ", Zp) | |||
| print("Predicted") | |||
| print(Z_pred) | |||
| #print(Z_true) | |||
| #print("Adjusted Rand Index =", adjusted_rand_score(Z_pred, Z_true)) | |||
| for j in range(prog_args.samples): | |||
| gen_graphs.append(graph_gen_from_blockmodel(B, Zp)) | |||
| save_path = '/lfs/local/0/rexy/graph-generation/eval_results/mmsb/' | |||
| utils.save_graph_list(gen_graphs, os.path.join(save_path, prog_args.dataset + '.dat')) | |||
| @@ -0,0 +1,155 @@ | |||
| import networkx as nx | |||
| import numpy as np | |||
| from utils import * | |||
| from data import * | |||
| def create(args): | |||
| ### load datasets | |||
| graphs=[] | |||
| # synthetic graphs | |||
| if args.graph_type=='ladder': | |||
| graphs = [] | |||
| for i in range(100, 201): | |||
| graphs.append(nx.ladder_graph(i)) | |||
| args.max_prev_node = 10 | |||
| elif args.graph_type=='ladder_small': | |||
| graphs = [] | |||
| for i in range(2, 11): | |||
| graphs.append(nx.ladder_graph(i)) | |||
| args.max_prev_node = 10 | |||
| elif args.graph_type=='tree': | |||
| graphs = [] | |||
| for i in range(2,5): | |||
| for j in range(3,5): | |||
| graphs.append(nx.balanced_tree(i,j)) | |||
| args.max_prev_node = 256 | |||
| elif args.graph_type=='caveman': | |||
| # graphs = [] | |||
| # for i in range(5,10): | |||
| # for j in range(5,25): | |||
| # for k in range(5): | |||
| # graphs.append(nx.relaxed_caveman_graph(i, j, p=0.1)) | |||
| graphs = [] | |||
| for i in range(2, 3): | |||
| for j in range(30, 81): | |||
| for k in range(10): | |||
| graphs.append(caveman_special(i,j, p_edge=0.3)) | |||
| args.max_prev_node = 100 | |||
| elif args.graph_type=='caveman_small': | |||
| # graphs = [] | |||
| # for i in range(2,5): | |||
| # for j in range(2,6): | |||
| # for k in range(10): | |||
| # graphs.append(nx.relaxed_caveman_graph(i, j, p=0.1)) | |||
| graphs = [] | |||
| for i in range(2, 3): | |||
| for j in range(6, 11): | |||
| for k in range(20): | |||
| graphs.append(caveman_special(i, j, p_edge=0.8)) # default 0.8 | |||
| args.max_prev_node = 20 | |||
| elif args.graph_type=='caveman_small_single': | |||
| # graphs = [] | |||
| # for i in range(2,5): | |||
| # for j in range(2,6): | |||
| # for k in range(10): | |||
| # graphs.append(nx.relaxed_caveman_graph(i, j, p=0.1)) | |||
| graphs = [] | |||
| for i in range(2, 3): | |||
| for j in range(8, 9): | |||
| for k in range(100): | |||
| graphs.append(caveman_special(i, j, p_edge=0.5)) | |||
| args.max_prev_node = 20 | |||
| elif args.graph_type.startswith('community'): | |||
| num_communities = int(args.graph_type[-1]) | |||
| print('Creating dataset with ', num_communities, ' communities') | |||
| c_sizes = np.random.choice([12, 13, 14, 15, 16, 17], num_communities) | |||
| #c_sizes = [15] * num_communities | |||
| for k in range(3000): | |||
| graphs.append(n_community(c_sizes, p_inter=0.01)) | |||
| args.max_prev_node = 80 | |||
| elif args.graph_type=='grid': | |||
| graphs = [] | |||
| for i in range(10,20): | |||
| for j in range(10,20): | |||
| graphs.append(nx.grid_2d_graph(i,j)) | |||
| args.max_prev_node = 40 | |||
| elif args.graph_type=='grid_small': | |||
| graphs = [] | |||
| for i in range(2,5): | |||
| for j in range(2,6): | |||
| graphs.append(nx.grid_2d_graph(i,j)) | |||
| args.max_prev_node = 15 | |||
| elif args.graph_type=='barabasi': | |||
| graphs = [] | |||
| for i in range(100,200): | |||
| for j in range(4,5): | |||
| for k in range(5): | |||
| graphs.append(nx.barabasi_albert_graph(i,j)) | |||
| args.max_prev_node = 130 | |||
| elif args.graph_type=='barabasi_small': | |||
| graphs = [] | |||
| for i in range(4,21): | |||
| for j in range(3,4): | |||
| for k in range(10): | |||
| graphs.append(nx.barabasi_albert_graph(i,j)) | |||
| args.max_prev_node = 20 | |||
| elif args.graph_type=='grid_big': | |||
| graphs = [] | |||
| for i in range(36, 46): | |||
| for j in range(36, 46): | |||
| graphs.append(nx.grid_2d_graph(i, j)) | |||
| args.max_prev_node = 90 | |||
| elif 'barabasi_noise' in args.graph_type: | |||
| graphs = [] | |||
| for i in range(100,101): | |||
| for j in range(4,5): | |||
| for k in range(500): | |||
| graphs.append(nx.barabasi_albert_graph(i,j)) | |||
| graphs = perturb_new(graphs,p=args.noise/10.0) | |||
| args.max_prev_node = 99 | |||
| # real graphs | |||
| elif args.graph_type == 'enzymes': | |||
| graphs= Graph_load_batch(min_num_nodes=10, name='ENZYMES') | |||
| args.max_prev_node = 25 | |||
| elif args.graph_type == 'enzymes_small': | |||
| graphs_raw = Graph_load_batch(min_num_nodes=10, name='ENZYMES') | |||
| graphs = [] | |||
| for G in graphs_raw: | |||
| if G.number_of_nodes()<=20: | |||
| graphs.append(G) | |||
| args.max_prev_node = 15 | |||
| elif args.graph_type == 'protein': | |||
| graphs = Graph_load_batch(min_num_nodes=20, name='PROTEINS_full') | |||
| args.max_prev_node = 80 | |||
| elif args.graph_type == 'DD': | |||
| graphs = Graph_load_batch(min_num_nodes=100, max_num_nodes=500, name='DD',node_attributes=False,graph_labels=True) | |||
| args.max_prev_node = 230 | |||
| elif args.graph_type == 'citeseer': | |||
| _, _, G = Graph_load(dataset='citeseer') | |||
| G = max(nx.connected_component_subgraphs(G), key=len) | |||
| G = nx.convert_node_labels_to_integers(G) | |||
| graphs = [] | |||
| for i in range(G.number_of_nodes()): | |||
| G_ego = nx.ego_graph(G, i, radius=3) | |||
| if G_ego.number_of_nodes() >= 50 and (G_ego.number_of_nodes() <= 400): | |||
| graphs.append(G_ego) | |||
| args.max_prev_node = 250 | |||
| elif args.graph_type == 'citeseer_small': | |||
| _, _, G = Graph_load(dataset='citeseer') | |||
| G = max(nx.connected_component_subgraphs(G), key=len) | |||
| G = nx.convert_node_labels_to_integers(G) | |||
| graphs = [] | |||
| for i in range(G.number_of_nodes()): | |||
| G_ego = nx.ego_graph(G, i, radius=1) | |||
| if (G_ego.number_of_nodes() >= 4) and (G_ego.number_of_nodes() <= 20): | |||
| graphs.append(G_ego) | |||
| shuffle(graphs) | |||
| graphs = graphs[0:200] | |||
| args.max_prev_node = 15 | |||
| return graphs | |||
| @@ -0,0 +1,75 @@ | |||
| README for dataset DD | |||
| === Usage === | |||
| This folder contains the following comma separated text files | |||
| (replace DS by the name of the dataset): | |||
| n = total number of nodes | |||
| m = total number of edges | |||
| N = number of graphs | |||
| (1) DS_A.txt (m lines) | |||
| sparse (block diagonal) adjacency matrix for all graphs, | |||
| each line corresponds to (row, col) resp. (node_id, node_id) | |||
| (2) DS_graph_indicator.txt (n lines) | |||
| column vector of graph identifiers for all nodes of all graphs, | |||
| the value in the i-th line is the graph_id of the node with node_id i | |||
| (3) DS_graph_labels.txt (N lines) | |||
| class labels for all graphs in the dataset, | |||
| the value in the i-th line is the class label of the graph with graph_id i | |||
| (4) DS_node_labels.txt (n lines) | |||
| column vector of node labels, | |||
| the value in the i-th line corresponds to the node with node_id i | |||
| There are OPTIONAL files if the respective information is available: | |||
| (5) DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt) | |||
| labels for the edges in DS_A_sparse.txt | |||
| (6) DS_edge_attributes.txt (m lines; same size as DS_A.txt) | |||
| attributes for the edges in DS_A.txt | |||
| (7) DS_node_attributes.txt (n lines) | |||
| matrix of node attributes, | |||
| the comma seperated values in the i-th line is the attribute vector of the node with node_id i | |||
| (8) DS_graph_attributes.txt (N lines) | |||
| regression values for all graphs in the dataset, | |||
| the value in the i-th line is the attribute of the graph with graph_id i | |||
| === Description === | |||
| D&D is a dataset of 1178 protein structures (Dobson and Doig, 2003). Each protein is | |||
| represented by a graph, in which the nodes are amino acids and two nodes are connected | |||
| by an edge if they are less than 6 Angstroms apart. The prediction task is to classify | |||
| the protein structures into enzymes and non-enzymes. | |||
| === Previous Use of the Dataset === | |||
| Neumann, M., Garnett R., Bauckhage Ch., Kersting K.: Propagation Kernels: Efficient Graph | |||
| Kernels from Propagated Information. Under review at MLJ. | |||
| Neumann, M., Patricia, N., Garnett, R., Kersting, K.: Efficient Graph Kernels by | |||
| Randomization. In: P.A. Flach, T.D. Bie, N. Cristianini (eds.) ECML/PKDD, Notes in | |||
| Computer Science, vol. 7523, pp. 378-393. Springer (2012). | |||
| Shervashidze, N., Schweitzer, P., van Leeuwen, E., Mehlhorn, K., Borgwardt, K.: | |||
| Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research 12, 2539-2561 (2011) | |||
| === References === | |||
| P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without | |||
| alignments. J. Mol. Biol., 330(4):771–783, Jul 2003. | |||
| @@ -0,0 +1,600 @@ | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 6 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 5 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 1 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 2 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 3 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| 4 | |||
| @@ -0,0 +1,71 @@ | |||
| README for dataset ENZYMES | |||
| === Usage === | |||
| This folder contains the following comma separated text files | |||
| (replace DS by the name of the dataset): | |||
| n = total number of nodes | |||
| m = total number of edges | |||
| N = number of graphs | |||
| (1) DS_A.txt (m lines) | |||
| sparse (block diagonal) adjacency matrix for all graphs, | |||
| each line corresponds to (row, col) resp. (node_id, node_id) | |||
| (2) DS_graph_indicator.txt (n lines) | |||
| column vector of graph identifiers for all nodes of all graphs, | |||
| the value in the i-th line is the graph_id of the node with node_id i | |||
| (3) DS_graph_labels.txt (N lines) | |||
| class labels for all graphs in the dataset, | |||
| the value in the i-th line is the class label of the graph with graph_id i | |||
| (4) DS_node_labels.txt (n lines) | |||
| column vector of node labels, | |||
| the value in the i-th line corresponds to the node with node_id i | |||
| There are OPTIONAL files if the respective information is available: | |||
| (5) DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt) | |||
| labels for the edges in DS_A_sparse.txt | |||
| (6) DS_edge_attributes.txt (m lines; same size as DS_A.txt) | |||
| attributes for the edges in DS_A.txt | |||
| (7) DS_node_attributes.txt (n lines) | |||
| matrix of node attributes, | |||
| the comma seperated values in the i-th line is the attribute vector of the node with node_id i | |||
| (8) DS_graph_attributes.txt (N lines) | |||
| regression values for all graphs in the dataset, | |||
| the value in the i-th line is the attribute of the graph with graph_id i | |||
| === Description === | |||
| ENZYMES is a dataset of protein tertiary structures obtained from (Borgwardt et al., 2005) | |||
| consisting of 600 enzymes from the BRENDA enzyme database (Schomburg et al., 2004). | |||
| In this case the task is to correctly assign each enzyme to one of the 6 EC top-level | |||
| classes. | |||
| === Previous Use of the Dataset === | |||
| Feragen, A., Kasenburg, N., Petersen, J., de Bruijne, M., Borgwardt, K.M.: Scalable | |||
| kernels for graphs with continuous attributes. In: C.J.C. Burges, L. Bottou, Z. Ghahra- | |||
| mani, K.Q. Weinberger (eds.) NIPS, pp. 216-224 (2013) | |||
| Neumann, M., Garnett R., Bauckhage Ch., Kersting K.: Propagation Kernels: Efficient Graph | |||
| Kernels from Propagated Information. Under review at MLJ. | |||
| === References === | |||
| K. M. Borgwardt, C. S. Ong, S. Schoenauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. | |||
| Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47–i56, | |||
| Jun 2005. | |||
| I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg. Brenda, | |||
| the enzyme database: updates and major new developments. Nucleic Acids Research, 32D:431–433, 2004. | |||
| @@ -0,0 +1,60 @@ | |||
| import numpy as np | |||
| import networkx as nx | |||
| G = nx.Graph() | |||
| # load data | |||
| data_adj = np.loadtxt('ENZYMES_A.txt', delimiter=',').astype(int) | |||
| data_node_att = np.loadtxt('ENZYMES_node_attributes.txt', delimiter=',') | |||
| data_node_label = np.loadtxt('ENZYMES_node_labels.txt', delimiter=',').astype(int) | |||
| data_graph_indicator = np.loadtxt('ENZYMES_graph_indicator.txt', delimiter=',').astype(int) | |||
| data_graph_labels = np.loadtxt('ENZYMES_graph_labels.txt', delimiter=',').astype(int) | |||
| data_tuple = list(map(tuple, data_adj)) | |||
| print(len(data_tuple)) | |||
| print(data_tuple[0]) | |||
| # add edges | |||
| G.add_edges_from(data_tuple) | |||
| # add node attributes | |||
| for i in range(data_node_att.shape[0]): | |||
| G.add_node(i+1, feature = data_node_att[i]) | |||
| G.add_node(i+1, label = data_node_label[i]) | |||
| G.remove_nodes_from(nx.isolates(G)) | |||
| print(G.number_of_nodes()) | |||
| print(G.number_of_edges()) | |||
| # split into graphs | |||
| graph_num = 600 | |||
| node_list = np.arange(data_graph_indicator.shape[0])+1 | |||
| graphs = [] | |||
| node_num_list = [] | |||
| for i in range(graph_num): | |||
| # find the nodes for each graph | |||
| nodes = node_list[data_graph_indicator==i+1] | |||
| G_sub = G.subgraph(nodes) | |||
| graphs.append(G_sub) | |||
| G_sub.graph['label'] = data_graph_labels[i] | |||
| # print('nodes', G_sub.number_of_nodes()) | |||
| # print('edges', G_sub.number_of_edges()) | |||
| # print('label', G_sub.graph) | |||
| node_num_list.append(G_sub.number_of_nodes()) | |||
| print('average', sum(node_num_list)/len(node_num_list)) | |||
| print('all', len(node_num_list)) | |||
| node_num_list = np.array(node_num_list) | |||
| print('selected', len(node_num_list[node_num_list>10])) | |||
| # print(graphs[0].nodes(data=True)[0][1]['feature']) | |||
| # print(graphs[0].nodes()) | |||
| keys = tuple(graphs[0].nodes()) | |||
| # print(nx.get_node_attributes(graphs[0], 'feature')) | |||
| dictionary = nx.get_node_attributes(graphs[0], 'feature') | |||
| # print('keys', keys) | |||
| # print('keys from dict', list(dictionary.keys())) | |||
| # print('valuse from dict', list(dictionary.values())) | |||
| features = np.zeros((len(dictionary), list(dictionary.values())[0].shape[0])) | |||
| for i in range(len(dictionary)): | |||
| features[i,:] = list(dictionary.values())[i] | |||
| # print(features) | |||
| # print(features.shape) | |||
| @@ -0,0 +1,61 @@ | |||
| README for dataset PROTEINS_full | |||
| === Usage === | |||
| This folder contains the following comma separated text files | |||
| (replace DS by the name of the dataset): | |||
| n = total number of nodes | |||
| m = total number of edges | |||
| N = number of graphs | |||
| (1) DS_A.txt (m lines) | |||
| sparse (block diagonal) adjacency matrix for all graphs, | |||
| each line corresponds to (row, col) resp. (node_id, node_id) | |||
| (2) DS_graph_indicator.txt (n lines) | |||
| column vector of graph identifiers for all nodes of all graphs, | |||
| the value in the i-th line is the graph_id of the node with node_id i | |||
| (3) DS_graph_labels.txt (N lines) | |||
| class labels for all graphs in the dataset, | |||
| the value in the i-th line is the class label of the graph with graph_id i | |||
| (4) DS_node_labels.txt (n lines) | |||
| column vector of node labels, | |||
| the value in the i-th line corresponds to the node with node_id i | |||
| There are OPTIONAL files if the respective information is available: | |||
| (5) DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt) | |||
| labels for the edges in DS_A_sparse.txt | |||
| (6) DS_edge_attributes.txt (m lines; same size as DS_A.txt) | |||
| attributes for the edges in DS_A.txt | |||
| (7) DS_node_attributes.txt (n lines) | |||
| matrix of node attributes, | |||
| the comma seperated values in the i-th line is the attribute vector of the node with node_id i | |||
| (8) DS_graph_attributes.txt (N lines) | |||
| regression values for all graphs in the dataset, | |||
| the value in the i-th line is the attribute of the graph with graph_id i | |||
| === Previous Use of the Dataset === | |||
| Neumann, M., Garnett R., Bauckhage Ch., Kersting K.: Propagation Kernels: Efficient Graph | |||
| Kernels from Propagated Information. Under review at MLJ. | |||
| === References === | |||
| K. M. Borgwardt, C. S. Ong, S. Schoenauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. | |||
| Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47–i56, | |||
| Jun 2005. | |||
| P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without | |||
| alignments. J. Mol. Biol., 330(4):771–783, Jul 2003. | |||
| @@ -0,0 +1,267 @@ | |||
| name: root | |||
| channels: | |||
| - soumith | |||
| - conda-forge | |||
| - defaults | |||
| dependencies: | |||
| - conda=4.3.29=py36_0 | |||
| - conda-env=2.6.0=0 | |||
| - gensim=3.0.0=py36_0 | |||
| - smart_open=1.5.3=py36_0 | |||
| - _ipyw_jlab_nb_ext_conf=0.1.0=py36he11e457_0 | |||
| - alabaster=0.7.10=py36h306e16b_0 | |||
| - anaconda=5.0.0=py36h06de3c5_0 | |||
| - anaconda-client=1.6.5=py36h19c0dcd_0 | |||
| - anaconda-navigator=1.6.8=py36h672ccc7_0 | |||
| - anaconda-project=0.8.0=py36h29abdf5_0 | |||
| - asn1crypto=0.22.0=py36h265ca7c_1 | |||
| - astroid=1.5.3=py36hbdb9df2_0 | |||
| - astropy=2.0.2=py36ha51211e_4 | |||
| - babel=2.5.0=py36h7d14adf_0 | |||
| - backports=1.0=py36hfa02d7e_1 | |||
| - backports.shutil_get_terminal_size=1.0.0=py36hfea85ff_2 | |||
| - beautifulsoup4=4.6.0=py36h49b8c8c_1 | |||
| - bitarray=0.8.1=py36h5834eb8_0 | |||
| - bkcharts=0.2=py36h735825a_0 | |||
| - blaze=0.11.3=py36h4e06776_0 | |||
| - bleach=2.0.0=py36h688b259_0 | |||
| - bokeh=0.12.7=py36h169c5fd_1 | |||
| - boto=2.48.0=py36h6e4cd66_1 | |||
| - bottleneck=1.2.1=py36haac1ea0_0 | |||
| - bz2file=0.98=py36_0 | |||
| - ca-certificates=2017.08.26=h1d4fec5_0 | |||
| - cairo=1.14.10=h58b644b_4 | |||
| - certifi=2017.7.27.1=py36h8b7b77e_0 | |||
| - cffi=1.10.0=py36had8d393_1 | |||
| - chardet=3.0.4=py36h0f667ec_1 | |||
| - click=6.7=py36h5253387_0 | |||
| - cloudpickle=0.4.0=py36h30f8c20_0 | |||
| - clyent=1.2.2=py36h7e57e65_1 | |||
| - colorama=0.3.9=py36h489cec4_0 | |||
| - conda-build=3.0.22=py36ha23cd1e_0 | |||
| - conda-verify=2.0.0=py36h98955d8_0 | |||
| - contextlib2=0.5.5=py36h6c84a62_0 | |||
| - cryptography=2.0.3=py36ha225213_1 | |||
| - curl=7.55.1=hcb0b314_2 | |||
| - cycler=0.10.0=py36h93f1223_0 | |||
| - cython=0.26.1=py36h21c49d0_0 | |||
| - cytoolz=0.8.2=py36h708bfd4_0 | |||
| - dask=0.15.2=py36h9b48dc4_0 | |||
| - dask-core=0.15.2=py36h0f988a8_0 | |||
| - datashape=0.5.4=py36h3ad6b5c_0 | |||
| - dbus=1.10.22=h3b5a359_0 | |||
| - decorator=4.1.2=py36hd076ac8_0 | |||
| - distributed=1.18.3=py36h73cd4ae_0 | |||
| - docutils=0.14=py36hb0f60f5_0 | |||
| - entrypoints=0.2.3=py36h1aec115_2 | |||
| - et_xmlfile=1.0.1=py36hd6bccc3_0 | |||
| - expat=2.2.4=hc00ebd1_1 | |||
| - fastcache=1.0.2=py36h5b0c431_0 | |||
| - filelock=2.0.12=py36hacfa1f5_0 | |||
| - flask=0.12.2=py36hb24657c_0 | |||
| - flask-cors=3.0.3=py36h2d857d3_0 | |||
| - fontconfig=2.12.4=h88586e7_1 | |||
| - freetype=2.8=h52ed37b_0 | |||
| - get_terminal_size=1.0.0=haa9412d_0 | |||
| - gevent=1.2.2=py36h2fe25dc_0 | |||
| - glib=2.53.6=hc861d11_1 | |||
| - glob2=0.5=py36h2c1b292_1 | |||
| - gmp=6.1.2=hb3b607b_0 | |||
| - gmpy2=2.0.8=py36h55090d7_1 | |||
| - graphite2=1.3.10=hc526e54_0 | |||
| - greenlet=0.4.12=py36h2d503a6_0 | |||
| - gst-plugins-base=1.12.2=he3457e5_0 | |||
| - gstreamer=1.12.2=h4f93127_0 | |||
| - h5py=2.7.0=py36he81ebca_1 | |||
| - harfbuzz=1.5.0=h2545bd6_0 | |||
| - hdf5=1.10.1=hb0523eb_0 | |||
| - heapdict=1.0.0=py36h79797d7_0 | |||
| - html5lib=0.999999999=py36h2cfc398_0 | |||
| - icu=58.2=h211956c_0 | |||
| - idna=2.6=py36h82fb2a8_1 | |||
| - imageio=2.2.0=py36he555465_0 | |||
| - imagesize=0.7.1=py36h52d8127_0 | |||
| - intel-openmp=2018.0.0=h15fc484_7 | |||
| - ipykernel=4.6.1=py36hbf841aa_0 | |||
| - ipython=6.1.0=py36hc72a948_1 | |||
| - ipython_genutils=0.2.0=py36hb52b0d5_0 | |||
| - ipywidgets=7.0.0=py36h7b55c3a_0 | |||
| - isort=4.2.15=py36had401c0_0 | |||
| - itsdangerous=0.24=py36h93cc618_1 | |||
| - jbig=2.1=hdba287a_0 | |||
| - jdcal=1.3=py36h4c697fb_0 | |||
| - jedi=0.10.2=py36h552def0_0 | |||
| - jinja2=2.9.6=py36h489bce4_1 | |||
| - jpeg=9b=habf39ab_1 | |||
| - jsonschema=2.6.0=py36h006f8b5_0 | |||
| - jupyter=1.0.0=py36h9896ce5_0 | |||
| - jupyter_client=5.1.0=py36h614e9ea_0 | |||
| - jupyter_console=5.2.0=py36he59e554_1 | |||
| - jupyter_core=4.3.0=py36h357a921_0 | |||
| - jupyterlab=0.27.0=py36h86377d0_2 | |||
| - jupyterlab_launcher=0.4.0=py36h4d8058d_0 | |||
| - lazy-object-proxy=1.3.1=py36h10fcdad_0 | |||
| - libedit=3.1=heed3624_0 | |||
| - libffi=3.2.1=h4deb6c0_3 | |||
| - libgcc=7.2.0=h69d50b8_2 | |||
| - libgcc-ng=7.2.0=hcbc56d2_1 | |||
| - libgfortran-ng=7.2.0=h6fcbd8e_1 | |||
| - libpng=1.6.32=hda9c8bc_2 | |||
| - libsodium=1.0.13=h31c71d8_2 | |||
| - libssh2=1.8.0=h8c220ad_2 | |||
| - libstdcxx-ng=7.2.0=h24385c6_1 | |||
| - libtiff=4.0.8=h90200ff_9 | |||
| - libtool=2.4.6=hd50d1a6_0 | |||
| - libxcb=1.12=he6ee5dd_2 | |||
| - libxml2=2.9.4=h6b072ca_5 | |||
| - libxslt=1.1.29=hcf9102b_5 | |||
| - llvmlite=0.20.0=py36_0 | |||
| - locket=0.2.0=py36h787c0ad_1 | |||
| - lxml=3.8.0=py36h6c6e760_0 | |||
| - lzo=2.10=hc0eb8fc_0 | |||
| - markupsafe=1.0=py36hd9260cd_1 | |||
| - matplotlib=2.0.2=py36h2acb4ad_1 | |||
| - mccabe=0.6.1=py36h5ad9710_1 | |||
| - mistune=0.7.4=py36hbab8784_0 | |||
| - mkl=2018.0.0=hb491cac_4 | |||
| - mkl-service=1.1.2=py36h17a0993_4 | |||
| - mpc=1.0.3=hf803216_4 | |||
| - mpfr=3.1.5=h12ff648_1 | |||
| - mpmath=0.19=py36h8cc018b_2 | |||
| - msgpack-python=0.4.8=py36hec4c5d1_0 | |||
| - multipledispatch=0.4.9=py36h41da3fb_0 | |||
| - navigator-updater=0.1.0=py36h14770f7_0 | |||
| - nbconvert=5.3.1=py36hb41ffb7_0 | |||
| - nbformat=4.4.0=py36h31c9010_0 | |||
| - ncurses=6.0=h06874d7_1 | |||
| - networkx=1.11=py36hfb3574a_0 | |||
| - nltk=3.2.4=py36h1a0979f_0 | |||
| - nose=1.3.7=py36hcdf7029_2 | |||
| - notebook=5.0.0=py36h0b20546_2 | |||
| - numba=0.35.0=np113py36_10 | |||
| - numexpr=2.6.2=py36hdd3393f_1 | |||
| - numpy=1.13.1=py36h5bc529a_2 | |||
| - numpydoc=0.7.0=py36h18f165f_0 | |||
| - odo=0.5.1=py36h90ed295_0 | |||
| - olefile=0.44=py36h79f9f78_0 | |||
| - openpyxl=2.4.8=py36h41dd2a8_1 | |||
| - openssl=1.0.2l=h9d1a558_3 | |||
| - packaging=16.8=py36ha668100_1 | |||
| - pandas=0.20.3=py36h842e28d_2 | |||
| - pandoc=1.19.2.1=hea2e7c5_1 | |||
| - pandocfilters=1.4.2=py36ha6701b7_1 | |||
| - pango=1.40.11=hedb6d6b_0 | |||
| - partd=0.3.8=py36h36fd896_0 | |||
| - patchelf=0.9=hf79760b_2 | |||
| - path.py=10.3.1=py36he0c6f6d_0 | |||
| - pathlib2=2.3.0=py36h49efa8e_0 | |||
| - patsy=0.4.1=py36ha3be15e_0 | |||
| - pcre=8.41=hc71a17e_0 | |||
| - pep8=1.7.0=py36h26ade29_0 | |||
| - pexpect=4.2.1=py36h3b9d41b_0 | |||
| - pickleshare=0.7.4=py36h63277f8_0 | |||
| - pillow=4.2.1=py36h9119f52_0 | |||
| - pip=9.0.1=py36h30f8307_2 | |||
| - pixman=0.34.0=ha72d70b_1 | |||
| - pkginfo=1.4.1=py36h215d178_1 | |||
| - ply=3.10=py36hed35086_0 | |||
| - prompt_toolkit=1.0.15=py36h17d85b1_0 | |||
| - psutil=5.2.2=py36h74c8701_0 | |||
| - ptyprocess=0.5.2=py36h69acd42_0 | |||
| - py=1.4.34=py36h0712aa3_1 | |||
| - pycodestyle=2.3.1=py36hf609f19_0 | |||
| - pycosat=0.6.2=py36h1a0ea17_1 | |||
| - pycparser=2.18=py36hf9f622e_1 | |||
| - pycrypto=2.6.1=py36h6998063_1 | |||
| - pycurl=7.43.0=py36h5e72054_3 | |||
| - pyflakes=1.5.0=py36h5510808_1 | |||
| - pygments=2.2.0=py36h0d3125c_0 | |||
| - pylint=1.7.2=py36h484ab97_0 | |||
| - pyodbc=4.0.17=py36h999153c_0 | |||
| - pyopenssl=17.2.0=py36h5cc804b_0 | |||
| - pyparsing=2.2.0=py36hee85983_1 | |||
| - pyqt=5.6.0=py36h0386399_5 | |||
| - pysocks=1.6.7=py36hd97a5b1_1 | |||
| - pytables=3.4.2=py36hdce54c9_1 | |||
| - pytest=3.2.1=py36h11ad3bb_1 | |||
| - python=3.6.2=h02fb82a_12 | |||
| - python-dateutil=2.6.1=py36h88d3b88_1 | |||
| - pytz=2017.2=py36hc2ccc2a_1 | |||
| - pywavelets=0.5.2=py36he602eb0_0 | |||
| - pyyaml=3.12=py36hafb9ca4_1 | |||
| - pyzmq=16.0.2=py36h3b0cf96_2 | |||
| - qt=5.6.2=h974d657_12 | |||
| - qtawesome=0.4.4=py36h609ed8c_0 | |||
| - qtconsole=4.3.1=py36h8f73b5b_0 | |||
| - qtpy=1.3.1=py36h3691cc8_0 | |||
| - readline=7.0=hac23ff0_3 | |||
| - requests=2.18.4=py36he2e5f8d_1 | |||
| - rope=0.10.5=py36h1f8c17e_0 | |||
| - ruamel_yaml=0.11.14=py36ha2fb22d_2 | |||
| - scikit-image=0.13.0=py36had3c07a_1 | |||
| - scikit-learn=0.19.0=py36h97ac459_2 | |||
| - scipy=0.19.1=py36h9976243_3 | |||
| - seaborn=0.8.0=py36h197244f_0 | |||
| - setuptools=36.5.0=py36he42e2e1_0 | |||
| - simplegeneric=0.8.1=py36h2cb9092_0 | |||
| - singledispatch=3.4.0.3=py36h7a266c3_0 | |||
| - sip=4.18.1=py36h51ed4ed_2 | |||
| - six=1.10.0=py36hcac75e4_1 | |||
| - snowballstemmer=1.2.1=py36h6febd40_0 | |||
| - sortedcollections=0.5.3=py36h3c761f9_0 | |||
| - sortedcontainers=1.5.7=py36hdf89491_0 | |||
| - sphinx=1.6.3=py36he5f0bdb_0 | |||
| - sphinxcontrib=1.0=py36h6d0f590_1 | |||
| - sphinxcontrib-websupport=1.0.1=py36hb5cb234_1 | |||
| - spyder=3.2.3=py36he38cbf7_1 | |||
| - sqlalchemy=1.1.13=py36hfb5efd7_0 | |||
| - sqlite=3.20.1=h6d8b0f3_1 | |||
| - statsmodels=0.8.0=py36h8533d0b_0 | |||
| - sympy=1.1.1=py36hc6d1c1c_0 | |||
| - tblib=1.3.2=py36h34cf8b6_0 | |||
| - terminado=0.6=py36ha25a19f_0 | |||
| - testpath=0.3.1=py36h8cadb63_0 | |||
| - tk=8.6.7=h5979e9b_1 | |||
| - toolz=0.8.2=py36h81f2dff_0 | |||
| - tornado=4.5.2=py36h1283b2a_0 | |||
| - traitlets=4.3.2=py36h674d592_0 | |||
| - typing=3.6.2=py36h7da032a_0 | |||
| - unicodecsv=0.14.1=py36ha668878_0 | |||
| - unixodbc=2.3.4=hc36303a_1 | |||
| - urllib3=1.22=py36hbe7ace6_0 | |||
| - wcwidth=0.1.7=py36hdf4376a_0 | |||
| - webencodings=0.5.1=py36h800622e_1 | |||
| - werkzeug=0.12.2=py36hc703753_0 | |||
| - wheel=0.29.0=py36he7f4e38_1 | |||
| - widgetsnbextension=3.0.2=py36hd01bb71_1 | |||
| - wrapt=1.10.11=py36h28b7045_0 | |||
| - xlrd=1.1.0=py36h1db9f0c_1 | |||
| - xlsxwriter=0.9.8=py36hf41c223_0 | |||
| - xlwt=1.3.0=py36h7b00a1f_0 | |||
| - xz=5.2.3=h2bcbf08_1 | |||
| - yaml=0.1.7=h96e3832_1 | |||
| - zeromq=4.2.2=hb0b69da_1 | |||
| - zict=0.1.2=py36ha0d441b_0 | |||
| - zlib=1.2.11=hfbfcf68_1 | |||
| - cuda80=1.0=0 | |||
| - pytorch=0.2.0=py36h53baedd_4cu80 | |||
| - torchvision=0.1.9=py36h7584368_1 | |||
| - pip: | |||
| - backports.shutil-get-terminal-size==1.0.0 | |||
| - et-xmlfile==1.0.1 | |||
| - gae==0.0.1 | |||
| - ipython-genutils==0.2.0 | |||
| - jupyter-client==5.1.0 | |||
| - jupyter-console==5.2.0 | |||
| - jupyter-core==4.3.0 | |||
| - jupyterlab-launcher==0.4.0 | |||
| - prompt-toolkit==1.0.15 | |||
| - protobuf==3.4.0 | |||
| - python-louvain==0.9 | |||
| - ruamel-yaml==0.11.14 | |||
| - smart-open==1.5.3 | |||
| - tables==3.4.2 | |||
| - tensorboard-logger==0.0.4 | |||
| - torch==0.2.0.post4 | |||
| prefix: /lfs/hyperion/0/jiaxuany/anaconda3 | |||
| @@ -0,0 +1,2 @@ | |||
| include orca/orca.h | |||
| @@ -0,0 +1,135 @@ | |||
| import concurrent.futures | |||
| from functools import partial | |||
| import networkx as nx | |||
| import numpy as np | |||
| from scipy.linalg import toeplitz | |||
| import pyemd | |||
| def emd(x, y, distance_scaling=1.0): | |||
| support_size = max(len(x), len(y)) | |||
| d_mat = toeplitz(range(support_size)).astype(np.float) | |||
| distance_mat = d_mat / distance_scaling | |||
| # convert histogram values x and y to float, and make them equal len | |||
| x = x.astype(np.float) | |||
| y = y.astype(np.float) | |||
| if len(x) < len(y): | |||
| x = np.hstack((x, [0.0] * (support_size - len(x)))) | |||
| elif len(y) < len(x): | |||
| y = np.hstack((y, [0.0] * (support_size - len(y)))) | |||
| emd = pyemd.emd(x, y, distance_mat) | |||
| return emd | |||
| def l2(x, y): | |||
| dist = np.linalg.norm(x - y, 2) | |||
| return dist | |||
| def gaussian_emd(x, y, sigma=1.0, distance_scaling=1.0): | |||
| ''' Gaussian kernel with squared distance in exponential term replaced by EMD | |||
| Args: | |||
| x, y: 1D pmf of two distributions with the same support | |||
| sigma: standard deviation | |||
| ''' | |||
| support_size = max(len(x), len(y)) | |||
| d_mat = toeplitz(range(support_size)).astype(np.float) | |||
| distance_mat = d_mat / distance_scaling | |||
| # convert histogram values x and y to float, and make them equal len | |||
| x = x.astype(np.float) | |||
| y = y.astype(np.float) | |||
| if len(x) < len(y): | |||
| x = np.hstack((x, [0.0] * (support_size - len(x)))) | |||
| elif len(y) < len(x): | |||
| y = np.hstack((y, [0.0] * (support_size - len(y)))) | |||
| emd = pyemd.emd(x, y, distance_mat) | |||
| return np.exp(-emd * emd / (2 * sigma * sigma)) | |||
| def gaussian(x, y, sigma=1.0): | |||
| dist = np.linalg.norm(x - y, 2) | |||
| return np.exp(-dist * dist / (2 * sigma * sigma)) | |||
| def kernel_parallel_unpacked(x, samples2, kernel): | |||
| d = 0 | |||
| for s2 in samples2: | |||
| d += kernel(x, s2) | |||
| return d | |||
| def kernel_parallel_worker(t): | |||
| return kernel_parallel_unpacked(*t) | |||
| def disc(samples1, samples2, kernel, is_parallel=True, *args, **kwargs): | |||
| ''' Discrepancy between 2 samples | |||
| ''' | |||
| d = 0 | |||
| if not is_parallel: | |||
| for s1 in samples1: | |||
| for s2 in samples2: | |||
| d += kernel(s1, s2, *args, **kwargs) | |||
| else: | |||
| with concurrent.futures.ProcessPoolExecutor() as executor: | |||
| for dist in executor.map(kernel_parallel_worker, | |||
| [(s1, samples2, partial(kernel, *args, **kwargs)) for s1 in samples1]): | |||
| d += dist | |||
| d /= len(samples1) * len(samples2) | |||
| return d | |||
| def compute_mmd(samples1, samples2, kernel, is_hist=True, *args, **kwargs): | |||
| ''' MMD between two samples | |||
| ''' | |||
| # normalize histograms into pmf | |||
| if is_hist: | |||
| samples1 = [s1 / np.sum(s1) for s1 in samples1] | |||
| samples2 = [s2 / np.sum(s2) for s2 in samples2] | |||
| # print('===============================') | |||
| # print('s1: ', disc(samples1, samples1, kernel, *args, **kwargs)) | |||
| # print('--------------------------') | |||
| # print('s2: ', disc(samples2, samples2, kernel, *args, **kwargs)) | |||
| # print('--------------------------') | |||
| # print('cross: ', disc(samples1, samples2, kernel, *args, **kwargs)) | |||
| # print('===============================') | |||
| return disc(samples1, samples1, kernel, *args, **kwargs) + \ | |||
| disc(samples2, samples2, kernel, *args, **kwargs) - \ | |||
| 2 * disc(samples1, samples2, kernel, *args, **kwargs) | |||
| def compute_emd(samples1, samples2, kernel, is_hist=True, *args, **kwargs): | |||
| ''' EMD between average of two samples | |||
| ''' | |||
| # normalize histograms into pmf | |||
| if is_hist: | |||
| samples1 = [np.mean(samples1)] | |||
| samples2 = [np.mean(samples2)] | |||
| # print('===============================') | |||
| # print('s1: ', disc(samples1, samples1, kernel, *args, **kwargs)) | |||
| # print('--------------------------') | |||
| # print('s2: ', disc(samples2, samples2, kernel, *args, **kwargs)) | |||
| # print('--------------------------') | |||
| # print('cross: ', disc(samples1, samples2, kernel, *args, **kwargs)) | |||
| # print('===============================') | |||
| return disc(samples1, samples2, kernel, *args, **kwargs),[samples1[0],samples2[0]] | |||
| def test(): | |||
| s1 = np.array([0.2, 0.8]) | |||
| s2 = np.array([0.3, 0.7]) | |||
| samples1 = [s1, s2] | |||
| s3 = np.array([0.25, 0.75]) | |||
| s4 = np.array([0.35, 0.65]) | |||
| samples2 = [s3, s4] | |||
| s5 = np.array([0.8, 0.2]) | |||
| s6 = np.array([0.7, 0.3]) | |||
| samples3 = [s5, s6] | |||
| print('between samples1 and samples2: ', compute_mmd(samples1, samples2, kernel=gaussian_emd, | |||
| is_parallel=False, sigma=1.0)) | |||
| print('between samples1 and samples3: ', compute_mmd(samples1, samples3, kernel=gaussian_emd, | |||
| is_parallel=False, sigma=1.0)) | |||
| if __name__ == '__main__': | |||
| test() | |||
| @@ -0,0 +1,6 @@ | |||
| 4 4 | |||
| 0 1 | |||
| 1 2 | |||
| 2 3 | |||
| 3 0 | |||
| @@ -0,0 +1,69 @@ | |||
| #include <cstdio> | |||
| #include <cstdlib> | |||
| #include <cstring> | |||
| #include <Python.h> | |||
| #include "orca/orca.h" | |||
| static PyObject * | |||
| orca_motifs(PyObject *self, PyObject *args) | |||
| { | |||
| const char *orbit_type; | |||
| int graphlet_size; | |||
| const char *input_filename; | |||
| const char *output_filename; | |||
| int sts; | |||
| if (!PyArg_ParseTuple(args, "siss", &orbit_type, &graphlet_size, &input_filename, &output_filename)) | |||
| return NULL; | |||
| sts = system(orbit_type); | |||
| motif_counts(orbit_type, graphlet_size, input_filename, output_filename); | |||
| return PyLong_FromLong(sts); | |||
| } | |||
| static PyMethodDef OrcaMethods[] = { | |||
| {"motifs", orca_motifs, METH_VARARGS, | |||
| "Compute motif counts."}, | |||
| }; | |||
| static struct PyModuleDef orcamodule = { | |||
| PyModuleDef_HEAD_INIT, | |||
| "orca", /* name of module */ | |||
| NULL, /* module documentation, may be NULL */ | |||
| -1, /* size of per-interpreter state of the module, | |||
| or -1 if the module keeps state in global variables. */ | |||
| OrcaMethods | |||
| }; | |||
| PyMODINIT_FUNC | |||
| PyInit_orca(void) | |||
| { | |||
| return PyModule_Create(&orcamodule); | |||
| } | |||
| int main(int argc, char *argv[]) { | |||
| wchar_t *program = Py_DecodeLocale(argv[0], NULL); | |||
| if (program == NULL) { | |||
| fprintf(stderr, "Fatal error: cannot decode argv[0]\n"); | |||
| exit(1); | |||
| } | |||
| /* Add a built-in module, before Py_Initialize */ | |||
| PyImport_AppendInittab("orca", PyInit_orca); | |||
| /* Pass argv[0] to the Python interpreter */ | |||
| Py_SetProgramName(program); | |||
| /* Initialize the Python interpreter. Required. */ | |||
| Py_Initialize(); | |||
| /* Optionally import the module; alternatively, | |||
| import can be deferred until the embedded script | |||
| imports it. */ | |||
| PyImport_ImportModule("orca"); | |||
| PyMem_RawFree(program); | |||
| } | |||
| @@ -0,0 +1,11 @@ | |||
| from distutils.core import setup, Extension | |||
| orca_module = Extension('orca', | |||
| sources = ['orcamodule.cpp'], | |||
| extra_compile_args=['-std=c++11'],) | |||
| setup (name = 'orca', | |||
| version = '1.0', | |||
| description = 'ORCA motif counting package', | |||
| ext_modules = [orca_module]) | |||
| @@ -0,0 +1,233 @@ | |||
| import concurrent.futures | |||
| from datetime import datetime | |||
| from functools import partial | |||
| import numpy as np | |||
| import networkx as nx | |||
| import os | |||
| import pickle as pkl | |||
| import subprocess as sp | |||
| import time | |||
| import eval.mmd as mmd | |||
| PRINT_TIME = False | |||
| def degree_worker(G): | |||
| return np.array(nx.degree_histogram(G)) | |||
| def add_tensor(x,y): | |||
| support_size = max(len(x), len(y)) | |||
| if len(x) < len(y): | |||
| x = np.hstack((x, [0.0] * (support_size - len(x)))) | |||
| elif len(y) < len(x): | |||
| y = np.hstack((y, [0.0] * (support_size - len(y)))) | |||
| return x+y | |||
| def degree_stats(graph_ref_list, graph_pred_list, is_parallel=False): | |||
| ''' Compute the distance between the degree distributions of two unordered sets of graphs. | |||
| Args: | |||
| graph_ref_list, graph_target_list: two lists of networkx graphs to be evaluated | |||
| ''' | |||
| sample_ref = [] | |||
| sample_pred = [] | |||
| # in case an empty graph is generated | |||
| graph_pred_list_remove_empty = [G for G in graph_pred_list if not G.number_of_nodes() == 0] | |||
| prev = datetime.now() | |||
| if is_parallel: | |||
| with concurrent.futures.ProcessPoolExecutor() as executor: | |||
| for deg_hist in executor.map(degree_worker, graph_ref_list): | |||
| sample_ref.append(deg_hist) | |||
| with concurrent.futures.ProcessPoolExecutor() as executor: | |||
| for deg_hist in executor.map(degree_worker, graph_pred_list_remove_empty): | |||
| sample_pred.append(deg_hist) | |||
| else: | |||
| for i in range(len(graph_ref_list)): | |||
| degree_temp = np.array(nx.degree_histogram(graph_ref_list[i])) | |||
| sample_ref.append(degree_temp) | |||
| for i in range(len(graph_pred_list_remove_empty)): | |||
| degree_temp = np.array(nx.degree_histogram(graph_pred_list_remove_empty[i])) | |||
| sample_pred.append(degree_temp) | |||
| print(len(sample_ref),len(sample_pred)) | |||
| mmd_dist = mmd.compute_mmd(sample_ref, sample_pred, kernel=mmd.gaussian_emd) | |||
| elapsed = datetime.now() - prev | |||
| if PRINT_TIME: | |||
| print('Time computing degree mmd: ', elapsed) | |||
| return mmd_dist | |||
| def clustering_worker(param): | |||
| G, bins = param | |||
| clustering_coeffs_list = list(nx.clustering(G).values()) | |||
| hist, _ = np.histogram( | |||
| clustering_coeffs_list, bins=bins, range=(0.0, 1.0), density=False) | |||
| return hist | |||
| def clustering_stats(graph_ref_list, graph_pred_list, bins=100, is_parallel=True): | |||
| sample_ref = [] | |||
| sample_pred = [] | |||
| graph_pred_list_remove_empty = [G for G in graph_pred_list if not G.number_of_nodes() == 0] | |||
| prev = datetime.now() | |||
| if is_parallel: | |||
| with concurrent.futures.ProcessPoolExecutor() as executor: | |||
| for clustering_hist in executor.map(clustering_worker, | |||
| [(G, bins) for G in graph_ref_list]): | |||
| sample_ref.append(clustering_hist) | |||
| with concurrent.futures.ProcessPoolExecutor() as executor: | |||
| for clustering_hist in executor.map(clustering_worker, | |||
| [(G, bins) for G in graph_pred_list_remove_empty]): | |||
| sample_pred.append(clustering_hist) | |||
| # check non-zero elements in hist | |||
| #total = 0 | |||
| #for i in range(len(sample_pred)): | |||
| # nz = np.nonzero(sample_pred[i])[0].shape[0] | |||
| # total += nz | |||
| #print(total) | |||
| else: | |||
| for i in range(len(graph_ref_list)): | |||
| clustering_coeffs_list = list(nx.clustering(graph_ref_list[i]).values()) | |||
| hist, _ = np.histogram( | |||
| clustering_coeffs_list, bins=bins, range=(0.0, 1.0), density=False) | |||
| sample_ref.append(hist) | |||
| for i in range(len(graph_pred_list_remove_empty)): | |||
| clustering_coeffs_list = list(nx.clustering(graph_pred_list_remove_empty[i]).values()) | |||
| hist, _ = np.histogram( | |||
| clustering_coeffs_list, bins=bins, range=(0.0, 1.0), density=False) | |||
| sample_pred.append(hist) | |||
| mmd_dist = mmd.compute_mmd(sample_ref, sample_pred, kernel=mmd.gaussian_emd, | |||
| sigma=1.0/10, distance_scaling=bins) | |||
| elapsed = datetime.now() - prev | |||
| if PRINT_TIME: | |||
| print('Time computing clustering mmd: ', elapsed) | |||
| return mmd_dist | |||
| # maps motif/orbit name string to its corresponding list of indices from orca output | |||
| motif_to_indices = { | |||
| '3path' : [1, 2], | |||
| '4cycle' : [8], | |||
| } | |||
| COUNT_START_STR = 'orbit counts: \n' | |||
| def edge_list_reindexed(G): | |||
| idx = 0 | |||
| id2idx = dict() | |||
| for u in G.nodes(): | |||
| id2idx[str(u)] = idx | |||
| idx += 1 | |||
| edges = [] | |||
| for (u, v) in G.edges(): | |||
| edges.append((id2idx[str(u)], id2idx[str(v)])) | |||
| return edges | |||
| def orca(graph): | |||
| tmp_fname = 'eval/orca/tmp.txt' | |||
| f = open(tmp_fname, 'w') | |||
| f.write(str(graph.number_of_nodes()) + ' ' + str(graph.number_of_edges()) + '\n') | |||
| for (u, v) in edge_list_reindexed(graph): | |||
| f.write(str(u) + ' ' + str(v) + '\n') | |||
| f.close() | |||
| output = sp.check_output(['./eval/orca/orca', 'node', '4', 'eval/orca/tmp.txt', 'std']) | |||
| output = output.decode('utf8').strip() | |||
| idx = output.find(COUNT_START_STR) + len(COUNT_START_STR) | |||
| output = output[idx:] | |||
| node_orbit_counts = np.array([list(map(int, node_cnts.strip().split(' ') )) | |||
| for node_cnts in output.strip('\n').split('\n')]) | |||
| try: | |||
| os.remove(tmp_fname) | |||
| except OSError: | |||
| pass | |||
| return node_orbit_counts | |||
| def motif_stats(graph_ref_list, graph_pred_list, motif_type='4cycle', ground_truth_match=None, bins=100): | |||
| # graph motif counts (int for each graph) | |||
| # normalized by graph size | |||
| total_counts_ref = [] | |||
| total_counts_pred = [] | |||
| num_matches_ref = [] | |||
| num_matches_pred = [] | |||
| graph_pred_list_remove_empty = [G for G in graph_pred_list if not G.number_of_nodes() == 0] | |||
| indices = motif_to_indices[motif_type] | |||
| for G in graph_ref_list: | |||
| orbit_counts = orca(G) | |||
| motif_counts = np.sum(orbit_counts[:, indices], axis=1) | |||
| if ground_truth_match is not None: | |||
| match_cnt = 0 | |||
| for elem in motif_counts: | |||
| if elem == ground_truth_match: | |||
| match_cnt += 1 | |||
| num_matches_ref.append(match_cnt / G.number_of_nodes()) | |||
| #hist, _ = np.histogram( | |||
| # motif_counts, bins=bins, density=False) | |||
| motif_temp = np.sum(motif_counts) / G.number_of_nodes() | |||
| total_counts_ref.append(motif_temp) | |||
| for G in graph_pred_list_remove_empty: | |||
| orbit_counts = orca(G) | |||
| motif_counts = np.sum(orbit_counts[:, indices], axis=1) | |||
| if ground_truth_match is not None: | |||
| match_cnt = 0 | |||
| for elem in motif_counts: | |||
| if elem == ground_truth_match: | |||
| match_cnt += 1 | |||
| num_matches_pred.append(match_cnt / G.number_of_nodes()) | |||
| motif_temp = np.sum(motif_counts) / G.number_of_nodes() | |||
| total_counts_pred.append(motif_temp) | |||
| mmd_dist = mmd.compute_mmd(total_counts_ref, total_counts_pred, kernel=mmd.gaussian, | |||
| is_hist=False) | |||
| #print('-------------------------') | |||
| #print(np.sum(total_counts_ref) / len(total_counts_ref)) | |||
| #print('...') | |||
| #print(np.sum(total_counts_pred) / len(total_counts_pred)) | |||
| #print('-------------------------') | |||
| return mmd_dist | |||
| def orbit_stats_all(graph_ref_list, graph_pred_list): | |||
| total_counts_ref = [] | |||
| total_counts_pred = [] | |||
| graph_pred_list_remove_empty = [G for G in graph_pred_list if not G.number_of_nodes() == 0] | |||
| for G in graph_ref_list: | |||
| try: | |||
| orbit_counts = orca(G) | |||
| except: | |||
| continue | |||
| orbit_counts_graph = np.sum(orbit_counts, axis=0) / G.number_of_nodes() | |||
| total_counts_ref.append(orbit_counts_graph) | |||
| for G in graph_pred_list: | |||
| try: | |||
| orbit_counts = orca(G) | |||
| except: | |||
| continue | |||
| orbit_counts_graph = np.sum(orbit_counts, axis=0) / G.number_of_nodes() | |||
| total_counts_pred.append(orbit_counts_graph) | |||
| total_counts_ref = np.array(total_counts_ref) | |||
| total_counts_pred = np.array(total_counts_pred) | |||
| mmd_dist = mmd.compute_mmd(total_counts_ref, total_counts_pred, kernel=mmd.gaussian, | |||
| is_hist=False, sigma=30.0) | |||
| print('-------------------------') | |||
| print(np.sum(total_counts_ref, axis=0) / len(total_counts_ref)) | |||
| print('...') | |||
| print(np.sum(total_counts_pred, axis=0) / len(total_counts_pred)) | |||
| print('-------------------------') | |||
| return mmd_dist | |||
| @@ -0,0 +1,692 @@ | |||
| import argparse | |||
| import numpy as np | |||
| import os | |||
| import re | |||
| from random import shuffle | |||
| import eval.stats | |||
| import utils | |||
| # import main.Args | |||
| from baselines.baseline_simple import * | |||
| class Args_evaluate(): | |||
| def __init__(self): | |||
| # loop over the settings | |||
| # self.model_name_all = ['GraphRNN_MLP','GraphRNN_RNN','Internal','Noise'] | |||
| # self.model_name_all = ['E-R', 'B-A'] | |||
| self.model_name_all = ['GraphRNN_RNN'] | |||
| # self.model_name_all = ['Baseline_DGMG'] | |||
| # list of dataset to evaluate | |||
| # use a list of 1 element to evaluate a single dataset | |||
| self.dataset_name_all = ['caveman', 'grid', 'barabasi', 'citeseer', 'DD'] | |||
| # self.dataset_name_all = ['citeseer_small','caveman_small'] | |||
| # self.dataset_name_all = ['barabasi_noise0','barabasi_noise2','barabasi_noise4','barabasi_noise6','barabasi_noise8','barabasi_noise10'] | |||
| # self.dataset_name_all = ['caveman_small', 'ladder_small', 'grid_small', 'ladder_small', 'enzymes_small', 'barabasi_small','citeseer_small'] | |||
| self.epoch_start=100 | |||
| self.epoch_end=3001 | |||
| self.epoch_step=100 | |||
| def find_nearest_idx(array,value): | |||
| idx = (np.abs(array-value)).argmin() | |||
| return idx | |||
| def extract_result_id_and_epoch(name, prefix, suffix): | |||
| ''' | |||
| Args: | |||
| eval_every: the number of epochs between consecutive evaluations | |||
| suffix: real_ or pred_ | |||
| Returns: | |||
| A tuple of (id, epoch number) extracted from the filename string | |||
| ''' | |||
| pos = name.find(suffix) + len(suffix) | |||
| end_pos = name.find('.dat') | |||
| result_id = name[pos:end_pos] | |||
| pos = name.find(prefix) + len(prefix) | |||
| end_pos = name.find('_', pos) | |||
| epochs = int(name[pos:end_pos]) | |||
| return result_id, epochs | |||
| def eval_list(real_graphs_filename, pred_graphs_filename, prefix, eval_every): | |||
| real_graphs_dict = {} | |||
| pred_graphs_dict = {} | |||
| for fname in real_graphs_filename: | |||
| result_id, epochs = extract_result_id_and_epoch(fname, prefix, 'real_') | |||
| if not epochs % eval_every == 0: | |||
| continue | |||
| if result_id not in real_graphs_dict: | |||
| real_graphs_dict[result_id] = {} | |||
| real_graphs_dict[result_id][epochs] = fname | |||
| for fname in pred_graphs_filename: | |||
| result_id, epochs = extract_result_id_and_epoch(fname, prefix, 'pred_') | |||
| if not epochs % eval_every == 0: | |||
| continue | |||
| if result_id not in pred_graphs_dict: | |||
| pred_graphs_dict[result_id] = {} | |||
| pred_graphs_dict[result_id][epochs] = fname | |||
| for result_id in real_graphs_dict.keys(): | |||
| for epochs in sorted(real_graphs_dict[result_id]): | |||
| real_g_list = utils.load_graph_list(real_graphs_dict[result_id][epochs]) | |||
| pred_g_list = utils.load_graph_list(pred_graphs_dict[result_id][epochs]) | |||
| shuffle(real_g_list) | |||
| shuffle(pred_g_list) | |||
| perturbed_g_list = perturb(real_g_list, 0.05) | |||
| #dist = eval.stats.degree_stats(real_g_list, pred_g_list) | |||
| dist = eval.stats.clustering_stats(real_g_list, pred_g_list) | |||
| print('dist between real and pred (', result_id, ') at epoch ', epochs, ': ', dist) | |||
| #dist = eval.stats.degree_stats(real_g_list, perturbed_g_list) | |||
| dist = eval.stats.clustering_stats(real_g_list, perturbed_g_list) | |||
| print('dist between real and perturbed: ', dist) | |||
| mid = len(real_g_list) // 2 | |||
| #dist = eval.stats.degree_stats(real_g_list[:mid], real_g_list[mid:]) | |||
| dist = eval.stats.clustering_stats(real_g_list[:mid], real_g_list[mid:]) | |||
| print('dist among real: ', dist) | |||
| def compute_basic_stats(real_g_list, target_g_list): | |||
| dist_degree = eval.stats.degree_stats(real_g_list, target_g_list) | |||
| dist_clustering = eval.stats.clustering_stats(real_g_list, target_g_list) | |||
| return dist_degree, dist_clustering | |||
| def clean_graphs(graph_real, graph_pred): | |||
| ''' Selecting graphs generated that have the similar sizes. | |||
| It is usually necessary for GraphRNN-S version, but not the full GraphRNN model. | |||
| ''' | |||
| shuffle(graph_real) | |||
| shuffle(graph_pred) | |||
| # get length | |||
| real_graph_len = np.array([len(graph_real[i]) for i in range(len(graph_real))]) | |||
| pred_graph_len = np.array([len(graph_pred[i]) for i in range(len(graph_pred))]) | |||
| # select pred samples | |||
| # The number of nodes are sampled from the similar distribution as the training set | |||
| pred_graph_new = [] | |||
| pred_graph_len_new = [] | |||
| for value in real_graph_len: | |||
| pred_idx = find_nearest_idx(pred_graph_len, value) | |||
| pred_graph_new.append(graph_pred[pred_idx]) | |||
| pred_graph_len_new.append(pred_graph_len[pred_idx]) | |||
| return graph_real, pred_graph_new | |||
| def load_ground_truth(dir_input, dataset_name, model_name='GraphRNN_RNN'): | |||
| ''' Read ground truth graphs. | |||
| ''' | |||
| if not 'small' in dataset_name: | |||
| hidden = 128 | |||
| else: | |||
| hidden = 64 | |||
| if model_name=='Internal' or model_name=='Noise' or model_name=='B-A' or model_name=='E-R': | |||
| fname_test = dir_input + 'GraphRNN_MLP' + '_' + dataset_name + '_' + str(args.num_layers) + '_' + str( | |||
| hidden) + '_test_' + str(0) + '.dat' | |||
| else: | |||
| fname_test = dir_input + model_name + '_' + dataset_name + '_' + str(args.num_layers) + '_' + str( | |||
| hidden) + '_test_' + str(0) + '.dat' | |||
| try: | |||
| graph_test = utils.load_graph_list(fname_test,is_real=True) | |||
| except: | |||
| print('Not found: ' + fname_test) | |||
| logging.warning('Not found: ' + fname_test) | |||
| return None | |||
| return graph_test | |||
| def eval_single_list(graphs, dir_input, dataset_name): | |||
| ''' Evaluate a list of graphs by comparing with graphs in directory dir_input. | |||
| Args: | |||
| dir_input: directory where ground truth graph list is stored | |||
| dataset_name: name of the dataset (ground truth) | |||
| ''' | |||
| graph_test = load_ground_truth(dir_input, dataset_name) | |||
| graph_test_len = len(graph_test) | |||
| graph_test = graph_test[int(0.8 * graph_test_len):] # test on a hold out test set | |||
| mmd_degree = eval.stats.degree_stats(graph_test, graphs) | |||
| mmd_clustering = eval.stats.clustering_stats(graph_test, graphs) | |||
| try: | |||
| mmd_4orbits = eval.stats.orbit_stats_all(graph_test, graphs) | |||
| except: | |||
| mmd_4orbits = -1 | |||
| print('deg: ', mmd_degree) | |||
| print('clustering: ', mmd_clustering) | |||
| print('orbits: ', mmd_4orbits) | |||
| def evaluation_epoch(dir_input, fname_output, model_name, dataset_name, args, is_clean=True, epoch_start=1000,epoch_end=3001,epoch_step=100): | |||
| with open(fname_output, 'w+') as f: | |||
| f.write('sample_time,epoch,degree_validate,clustering_validate,orbits4_validate,degree_test,clustering_test,orbits4_test\n') | |||
| # TODO: Maybe refactor into a separate file/function that specifies THE naming convention | |||
| # across main and evaluate | |||
| if not 'small' in dataset_name: | |||
| hidden = 128 | |||
| else: | |||
| hidden = 64 | |||
| # read real graph | |||
| if model_name=='Internal' or model_name=='Noise' or model_name=='B-A' or model_name=='E-R': | |||
| fname_test = dir_input + 'GraphRNN_MLP' + '_' + dataset_name + '_' + str(args.num_layers) + '_' + str( | |||
| hidden) + '_test_' + str(0) + '.dat' | |||
| elif 'Baseline' in model_name: | |||
| fname_test = dir_input + model_name + '_' + dataset_name + '_' + str(64) + '_test_' + str(0) + '.dat' | |||
| else: | |||
| fname_test = dir_input + model_name + '_' + dataset_name + '_' + str(args.num_layers) + '_' + str( | |||
| hidden) + '_test_' + str(0) + '.dat' | |||
| try: | |||
| graph_test = utils.load_graph_list(fname_test,is_real=True) | |||
| except: | |||
| print('Not found: ' + fname_test) | |||
| logging.warning('Not found: ' + fname_test) | |||
| return None | |||
| graph_test_len = len(graph_test) | |||
| graph_train = graph_test[0:int(0.8 * graph_test_len)] # train | |||
| graph_validate = graph_test[0:int(0.2 * graph_test_len)] # validate | |||
| graph_test = graph_test[int(0.8 * graph_test_len):] # test on a hold out test set | |||
| graph_test_aver = 0 | |||
| for graph in graph_test: | |||
| graph_test_aver+=graph.number_of_nodes() | |||
| graph_test_aver /= len(graph_test) | |||
| print('test average len',graph_test_aver) | |||
| # get performance for proposed approaches | |||
| if 'GraphRNN' in model_name: | |||
| # read test graph | |||
| for epoch in range(epoch_start,epoch_end,epoch_step): | |||
| for sample_time in range(1,4): | |||
| # get filename | |||
| fname_pred = dir_input + model_name + '_' + dataset_name + '_' + str(args.num_layers) + '_' + str(hidden) + '_pred_' + str(epoch) + '_' + str(sample_time) + '.dat' | |||
| # load graphs | |||
| try: | |||
| graph_pred = utils.load_graph_list(fname_pred,is_real=False) # default False | |||
| except: | |||
| print('Not found: '+ fname_pred) | |||
| logging.warning('Not found: '+ fname_pred) | |||
| continue | |||
| # clean graphs | |||
| if is_clean: | |||
| graph_test, graph_pred = clean_graphs(graph_test, graph_pred) | |||
| else: | |||
| shuffle(graph_pred) | |||
| graph_pred = graph_pred[0:len(graph_test)] | |||
| print('len graph_test', len(graph_test)) | |||
| print('len graph_validate', len(graph_validate)) | |||
| print('len graph_pred', len(graph_pred)) | |||
| graph_pred_aver = 0 | |||
| for graph in graph_pred: | |||
| graph_pred_aver += graph.number_of_nodes() | |||
| graph_pred_aver /= len(graph_pred) | |||
| print('pred average len', graph_pred_aver) | |||
| # evaluate MMD test | |||
| mmd_degree = eval.stats.degree_stats(graph_test, graph_pred) | |||
| mmd_clustering = eval.stats.clustering_stats(graph_test, graph_pred) | |||
| try: | |||
| mmd_4orbits = eval.stats.orbit_stats_all(graph_test, graph_pred) | |||
| except: | |||
| mmd_4orbits = -1 | |||
| # evaluate MMD validate | |||
| mmd_degree_validate = eval.stats.degree_stats(graph_validate, graph_pred) | |||
| mmd_clustering_validate = eval.stats.clustering_stats(graph_validate, graph_pred) | |||
| try: | |||
| mmd_4orbits_validate = eval.stats.orbit_stats_all(graph_validate, graph_pred) | |||
| except: | |||
| mmd_4orbits_validate = -1 | |||
| # write results | |||
| f.write(str(sample_time)+','+ | |||
| str(epoch)+','+ | |||
| str(mmd_degree_validate)+','+ | |||
| str(mmd_clustering_validate)+','+ | |||
| str(mmd_4orbits_validate)+','+ | |||
| str(mmd_degree)+','+ | |||
| str(mmd_clustering)+','+ | |||
| str(mmd_4orbits)+'\n') | |||
| print('degree',mmd_degree,'clustering',mmd_clustering,'orbits',mmd_4orbits) | |||
| # get internal MMD (MMD between ground truth validation and test sets) | |||
| if model_name == 'Internal': | |||
| mmd_degree_validate = eval.stats.degree_stats(graph_test, graph_validate) | |||
| mmd_clustering_validate = eval.stats.clustering_stats(graph_test, graph_validate) | |||
| try: | |||
| mmd_4orbits_validate = eval.stats.orbit_stats_all(graph_test, graph_validate) | |||
| except: | |||
| mmd_4orbits_validate = -1 | |||
| f.write(str(-1) + ',' + str(-1) + ',' + str(mmd_degree_validate) + ',' + str( | |||
| mmd_clustering_validate) + ',' + str(mmd_4orbits_validate) | |||
| + ',' + str(-1) + ',' + str(-1) + ',' + str(-1) + '\n') | |||
| # get MMD between ground truth and its perturbed graphs | |||
| if model_name == 'Noise': | |||
| graph_validate_perturbed = perturb(graph_validate, 0.05) | |||
| mmd_degree_validate = eval.stats.degree_stats(graph_test, graph_validate_perturbed) | |||
| mmd_clustering_validate = eval.stats.clustering_stats(graph_test, graph_validate_perturbed) | |||
| try: | |||
| mmd_4orbits_validate = eval.stats.orbit_stats_all(graph_test, graph_validate_perturbed) | |||
| except: | |||
| mmd_4orbits_validate = -1 | |||
| f.write(str(-1) + ',' + str(-1) + ',' + str(mmd_degree_validate) + ',' + str( | |||
| mmd_clustering_validate) + ',' + str(mmd_4orbits_validate) | |||
| + ',' + str(-1) + ',' + str(-1) + ',' + str(-1) + '\n') | |||
| # get E-R MMD | |||
| if model_name == 'E-R': | |||
| graph_pred = Graph_generator_baseline(graph_train,generator='Gnp') | |||
| # clean graphs | |||
| if is_clean: | |||
| graph_test, graph_pred = clean_graphs(graph_test, graph_pred) | |||
| print('len graph_test', len(graph_test)) | |||
| print('len graph_pred', len(graph_pred)) | |||
| mmd_degree = eval.stats.degree_stats(graph_test, graph_pred) | |||
| mmd_clustering = eval.stats.clustering_stats(graph_test, graph_pred) | |||
| try: | |||
| mmd_4orbits_validate = eval.stats.orbit_stats_all(graph_test, graph_pred) | |||
| except: | |||
| mmd_4orbits_validate = -1 | |||
| f.write(str(-1) + ',' + str(-1) + ',' + str(-1) + ',' + str(-1) + ',' + str(-1) | |||
| + ',' + str(mmd_degree) + ',' + str(mmd_clustering) + ',' + str(mmd_4orbits_validate) + '\n') | |||
| # get B-A MMD | |||
| if model_name == 'B-A': | |||
| graph_pred = Graph_generator_baseline(graph_train, generator='BA') | |||
| # clean graphs | |||
| if is_clean: | |||
| graph_test, graph_pred = clean_graphs(graph_test, graph_pred) | |||
| print('len graph_test', len(graph_test)) | |||
| print('len graph_pred', len(graph_pred)) | |||
| mmd_degree = eval.stats.degree_stats(graph_test, graph_pred) | |||
| mmd_clustering = eval.stats.clustering_stats(graph_test, graph_pred) | |||
| try: | |||
| mmd_4orbits_validate = eval.stats.orbit_stats_all(graph_test, graph_pred) | |||
| except: | |||
| mmd_4orbits_validate = -1 | |||
| f.write(str(-1) + ',' + str(-1) + ',' + str(-1) + ',' + str(-1) + ',' + str(-1) | |||
| + ',' + str(mmd_degree) + ',' + str(mmd_clustering) + ',' + str(mmd_4orbits_validate) + '\n') | |||
| # get performance for baseline approaches | |||
| if 'Baseline' in model_name: | |||
| # read test graph | |||
| for epoch in range(epoch_start, epoch_end, epoch_step): | |||
| # get filename | |||
| fname_pred = dir_input + model_name + '_' + dataset_name + '_' + str( | |||
| 64) + '_pred_' + str(epoch) + '.dat' | |||
| # load graphs | |||
| try: | |||
| graph_pred = utils.load_graph_list(fname_pred, is_real=True) # default False | |||
| except: | |||
| print('Not found: ' + fname_pred) | |||
| logging.warning('Not found: ' + fname_pred) | |||
| continue | |||
| # clean graphs | |||
| if is_clean: | |||
| graph_test, graph_pred = clean_graphs(graph_test, graph_pred) | |||
| else: | |||
| shuffle(graph_pred) | |||
| graph_pred = graph_pred[0:len(graph_test)] | |||
| print('len graph_test', len(graph_test)) | |||
| print('len graph_validate', len(graph_validate)) | |||
| print('len graph_pred', len(graph_pred)) | |||
| graph_pred_aver = 0 | |||
| for graph in graph_pred: | |||
| graph_pred_aver += graph.number_of_nodes() | |||
| graph_pred_aver /= len(graph_pred) | |||
| print('pred average len', graph_pred_aver) | |||
| # evaluate MMD test | |||
| mmd_degree = eval.stats.degree_stats(graph_test, graph_pred) | |||
| mmd_clustering = eval.stats.clustering_stats(graph_test, graph_pred) | |||
| try: | |||
| mmd_4orbits = eval.stats.orbit_stats_all(graph_test, graph_pred) | |||
| except: | |||
| mmd_4orbits = -1 | |||
| # evaluate MMD validate | |||
| mmd_degree_validate = eval.stats.degree_stats(graph_validate, graph_pred) | |||
| mmd_clustering_validate = eval.stats.clustering_stats(graph_validate, graph_pred) | |||
| try: | |||
| mmd_4orbits_validate = eval.stats.orbit_stats_all(graph_validate, graph_pred) | |||
| except: | |||
| mmd_4orbits_validate = -1 | |||
| # write results | |||
| f.write(str(-1) + ',' + str(epoch) + ',' + str(mmd_degree_validate) + ',' + str( | |||
| mmd_clustering_validate) + ',' + str(mmd_4orbits_validate) | |||
| + ',' + str(mmd_degree) + ',' + str(mmd_clustering) + ',' + str(mmd_4orbits) + '\n') | |||
| print('degree', mmd_degree, 'clustering', mmd_clustering, 'orbits', mmd_4orbits) | |||
| return True | |||
| def evaluation(args_evaluate,dir_input, dir_output, model_name_all, dataset_name_all, args, overwrite = True): | |||
| ''' Evaluate the performance of a set of models on a set of datasets. | |||
| ''' | |||
| for model_name in model_name_all: | |||
| for dataset_name in dataset_name_all: | |||
| # check output exist | |||
| fname_output = dir_output+model_name+'_'+dataset_name+'.csv' | |||
| print('processing: '+dir_output + model_name + '_' + dataset_name + '.csv') | |||
| logging.info('processing: '+dir_output + model_name + '_' + dataset_name + '.csv') | |||
| if overwrite==False and os.path.isfile(fname_output): | |||
| print(dir_output+model_name+'_'+dataset_name+'.csv exists!') | |||
| logging.info(dir_output+model_name+'_'+dataset_name+'.csv exists!') | |||
| continue | |||
| evaluation_epoch(dir_input,fname_output,model_name,dataset_name,args,is_clean=True, epoch_start=args_evaluate.epoch_start,epoch_end=args_evaluate.epoch_end,epoch_step=args_evaluate.epoch_step) | |||
| def eval_list_fname(real_graph_filename, pred_graphs_filename, baselines, | |||
| eval_every, epoch_range=None, out_file_prefix=None): | |||
| ''' Evaluate list of predicted graphs compared to ground truth, stored in files. | |||
| Args: | |||
| baselines: dict mapping name of the baseline to list of generated graphs. | |||
| ''' | |||
| if out_file_prefix is not None: | |||
| out_files = { | |||
| 'train': open(out_file_prefix + '_train.txt', 'w+'), | |||
| 'compare': open(out_file_prefix + '_compare.txt', 'w+') | |||
| } | |||
| out_files['train'].write('degree,clustering,orbits4\n') | |||
| line = 'metric,real,ours,perturbed' | |||
| for bl in baselines: | |||
| line += ',' + bl | |||
| line += '\n' | |||
| out_files['compare'].write(line) | |||
| results = { | |||
| 'deg': { | |||
| 'real': 0, | |||
| 'ours': 100, # take min over all training epochs | |||
| 'perturbed': 0, | |||
| 'kron': 0}, | |||
| 'clustering': { | |||
| 'real': 0, | |||
| 'ours': 100, | |||
| 'perturbed': 0, | |||
| 'kron': 0}, | |||
| 'orbits4': { | |||
| 'real': 0, | |||
| 'ours': 100, | |||
| 'perturbed': 0, | |||
| 'kron': 0} | |||
| } | |||
| num_evals = len(pred_graphs_filename) | |||
| if epoch_range is None: | |||
| epoch_range = [i * eval_every for i in range(num_evals)] | |||
| for i in range(num_evals): | |||
| real_g_list = utils.load_graph_list(real_graph_filename) | |||
| #pred_g_list = utils.load_graph_list(pred_graphs_filename[i]) | |||
| # contains all predicted G | |||
| pred_g_list_raw = utils.load_graph_list(pred_graphs_filename[i]) | |||
| if len(real_g_list)>200: | |||
| real_g_list = real_g_list[0:200] | |||
| shuffle(real_g_list) | |||
| shuffle(pred_g_list_raw) | |||
| # get length | |||
| real_g_len_list = np.array([len(real_g_list[i]) for i in range(len(real_g_list))]) | |||
| pred_g_len_list_raw = np.array([len(pred_g_list_raw[i]) for i in range(len(pred_g_list_raw))]) | |||
| # get perturb real | |||
| #perturbed_g_list_001 = perturb(real_g_list, 0.01) | |||
| perturbed_g_list_005 = perturb(real_g_list, 0.05) | |||
| #perturbed_g_list_010 = perturb(real_g_list, 0.10) | |||
| # select pred samples | |||
| # The number of nodes are sampled from the similar distribution as the training set | |||
| pred_g_list = [] | |||
| pred_g_len_list = [] | |||
| for value in real_g_len_list: | |||
| pred_idx = find_nearest_idx(pred_g_len_list_raw, value) | |||
| pred_g_list.append(pred_g_list_raw[pred_idx]) | |||
| pred_g_len_list.append(pred_g_len_list_raw[pred_idx]) | |||
| # delete | |||
| pred_g_len_list_raw = np.delete(pred_g_len_list_raw, pred_idx) | |||
| del pred_g_list_raw[pred_idx] | |||
| if len(pred_g_list) == len(real_g_list): | |||
| break | |||
| # pred_g_len_list = np.array(pred_g_len_list) | |||
| print('################## epoch {} ##################'.format(epoch_range[i])) | |||
| # info about graph size | |||
| print('real average nodes', | |||
| sum([real_g_list[i].number_of_nodes() for i in range(len(real_g_list))]) / len(real_g_list)) | |||
| print('pred average nodes', | |||
| sum([pred_g_list[i].number_of_nodes() for i in range(len(pred_g_list))]) / len(pred_g_list)) | |||
| print('num of real graphs', len(real_g_list)) | |||
| print('num of pred graphs', len(pred_g_list)) | |||
| # ======================================== | |||
| # Evaluation | |||
| # ======================================== | |||
| mid = len(real_g_list) // 2 | |||
| dist_degree, dist_clustering = compute_basic_stats(real_g_list[:mid], real_g_list[mid:]) | |||
| #dist_4cycle = eval.stats.motif_stats(real_g_list[:mid], real_g_list[mid:]) | |||
| dist_4orbits = eval.stats.orbit_stats_all(real_g_list[:mid], real_g_list[mid:]) | |||
| print('degree dist among real: ', dist_degree) | |||
| print('clustering dist among real: ', dist_clustering) | |||
| #print('4 cycle dist among real: ', dist_4cycle) | |||
| print('orbits dist among real: ', dist_4orbits) | |||
| results['deg']['real'] += dist_degree | |||
| results['clustering']['real'] += dist_clustering | |||
| results['orbits4']['real'] += dist_4orbits | |||
| dist_degree, dist_clustering = compute_basic_stats(real_g_list, pred_g_list) | |||
| #dist_4cycle = eval.stats.motif_stats(real_g_list, pred_g_list) | |||
| dist_4orbits = eval.stats.orbit_stats_all(real_g_list, pred_g_list) | |||
| print('degree dist between real and pred at epoch ', epoch_range[i], ': ', dist_degree) | |||
| print('clustering dist between real and pred at epoch ', epoch_range[i], ': ', dist_clustering) | |||
| #print('4 cycle dist between real and pred at epoch: ', epoch_range[i], dist_4cycle) | |||
| print('orbits dist between real and pred at epoch ', epoch_range[i], ': ', dist_4orbits) | |||
| results['deg']['ours'] = min(dist_degree, results['deg']['ours']) | |||
| results['clustering']['ours'] = min(dist_clustering, results['clustering']['ours']) | |||
| results['orbits4']['ours'] = min(dist_4orbits, results['orbits4']['ours']) | |||
| # performance at training time | |||
| out_files['train'].write(str(dist_degree) + ',') | |||
| out_files['train'].write(str(dist_clustering) + ',') | |||
| out_files['train'].write(str(dist_4orbits) + ',') | |||
| dist_degree, dist_clustering = compute_basic_stats(real_g_list, perturbed_g_list_005) | |||
| #dist_4cycle = eval.stats.motif_stats(real_g_list, perturbed_g_list_005) | |||
| dist_4orbits = eval.stats.orbit_stats_all(real_g_list, perturbed_g_list_005) | |||
| print('degree dist between real and perturbed at epoch ', epoch_range[i], ': ', dist_degree) | |||
| print('clustering dist between real and perturbed at epoch ', epoch_range[i], ': ', dist_clustering) | |||
| #print('4 cycle dist between real and perturbed at epoch: ', epoch_range[i], dist_4cycle) | |||
| print('orbits dist between real and perturbed at epoch ', epoch_range[i], ': ', dist_4orbits) | |||
| results['deg']['perturbed'] += dist_degree | |||
| results['clustering']['perturbed'] += dist_clustering | |||
| results['orbits4']['perturbed'] += dist_4orbits | |||
| if i == 0: | |||
| # Baselines | |||
| for baseline in baselines: | |||
| dist_degree, dist_clustering = compute_basic_stats(real_g_list, baselines[baseline]) | |||
| dist_4orbits = eval.stats.orbit_stats_all(real_g_list, baselines[baseline]) | |||
| results['deg'][baseline] = dist_degree | |||
| results['clustering'][baseline] = dist_clustering | |||
| results['orbits4'][baseline] = dist_4orbits | |||
| print('Kron: deg=', dist_degree, ', clustering=', dist_clustering, | |||
| ', orbits4=', dist_4orbits) | |||
| out_files['train'].write('\n') | |||
| for metric, methods in results.items(): | |||
| methods['real'] /= num_evals | |||
| methods['perturbed'] /= num_evals | |||
| # Write results | |||
| for metric, methods in results.items(): | |||
| line = metric+','+ \ | |||
| str(methods['real'])+','+ \ | |||
| str(methods['ours'])+','+ \ | |||
| str(methods['perturbed']) | |||
| for baseline in baselines: | |||
| line += ',' + str(methods[baseline]) | |||
| line += '\n' | |||
| out_files['compare'].write(line) | |||
| for _, out_f in out_files.items(): | |||
| out_f.close() | |||
| def eval_performance(datadir, prefix=None, args=None, eval_every=200, out_file_prefix=None, | |||
| sample_time = 2, baselines={}): | |||
| if args is None: | |||
| real_graphs_filename = [datadir + f for f in os.listdir(datadir) | |||
| if re.match(prefix + '.*real.*\.dat', f)] | |||
| pred_graphs_filename = [datadir + f for f in os.listdir(datadir) | |||
| if re.match(prefix + '.*pred.*\.dat', f)] | |||
| eval_list(real_graphs_filename, pred_graphs_filename, prefix, 200) | |||
| else: | |||
| # # for vanilla graphrnn | |||
| # real_graphs_filename = [datadir + args.graph_save_path + args.note + '_' + args.graph_type + '_' + \ | |||
| # str(epoch) + '_pred_' + str(args.num_layers) + '_' + str(args.bptt) + '_' + str(args.bptt_len) + '.dat' for epoch in range(0,50001,eval_every)] | |||
| # pred_graphs_filename = [datadir + args.graph_save_path + args.note + '_' + args.graph_type + '_' + \ | |||
| # str(epoch) + '_real_' + str(args.num_layers) + '_' + str(args.bptt) + '_' + str(args.bptt_len) + '.dat' for epoch in range(0,50001,eval_every)] | |||
| real_graph_filename = datadir+args.graph_save_path + args.fname_test + '0.dat' | |||
| # for proposed model | |||
| end_epoch = 3001 | |||
| epoch_range = range(eval_every, end_epoch, eval_every) | |||
| pred_graphs_filename = [datadir+args.graph_save_path + args.fname_pred+str(epoch)+'_'+str(sample_time)+'.dat' | |||
| for epoch in epoch_range] | |||
| # for baseline model | |||
| #pred_graphs_filename = [datadir+args.fname_baseline+'.dat'] | |||
| #real_graphs_filename = [datadir + args.graph_save_path + args.note + '_' + args.graph_type + '_' + \ | |||
| # str(epoch) + '_real_' + str(args.num_layers) + '_' + str(args.bptt) + '_' + str( | |||
| # args.bptt_len) + '_' + str(args.gumbel) + '.dat' for epoch in range(10000, 50001, eval_every)] | |||
| #pred_graphs_filename = [datadir + args.graph_save_path + args.note + '_' + args.graph_type + '_' + \ | |||
| # str(epoch) + '_pred_' + str(args.num_layers) + '_' + str(args.bptt) + '_' + str( | |||
| # args.bptt_len) + '_' + str(args.gumbel) + '.dat' for epoch in range(10000, 50001, eval_every)] | |||
| eval_list_fname(real_graph_filename, pred_graphs_filename, baselines, | |||
| epoch_range=epoch_range, | |||
| eval_every=eval_every, | |||
| out_file_prefix=out_file_prefix) | |||
| def process_kron(kron_dir): | |||
| txt_files = [] | |||
| for f in os.listdir(kron_dir): | |||
| filename = os.fsdecode(f) | |||
| if filename.endswith('.txt'): | |||
| txt_files.append(filename) | |||
| elif filename.endswith('.dat'): | |||
| return utils.load_graph_list(os.path.join(kron_dir, filename)) | |||
| G_list = [] | |||
| for filename in txt_files: | |||
| G_list.append(utils.snap_txt_output_to_nx(os.path.join(kron_dir, filename))) | |||
| return G_list | |||
| if __name__ == '__main__': | |||
| args = Args() | |||
| args_evaluate = Args_evaluate() | |||
| parser = argparse.ArgumentParser(description='Evaluation arguments.') | |||
| feature_parser = parser.add_mutually_exclusive_group(required=False) | |||
| feature_parser.add_argument('--export-real', dest='export', action='store_true') | |||
| feature_parser.add_argument('--no-export-real', dest='export', action='store_false') | |||
| feature_parser.add_argument('--kron-dir', dest='kron_dir', | |||
| help='Directory where graphs generated by kronecker method is stored.') | |||
| parser.add_argument('--testfile', dest='test_file', | |||
| help='The file that stores list of graphs to be evaluated. Only used when 1 list of ' | |||
| 'graphs is to be evaluated.') | |||
| parser.add_argument('--dir-prefix', dest='dir_prefix', | |||
| help='The file that stores list of graphs to be evaluated. Can be used when evaluating multiple' | |||
| 'models on multiple datasets.') | |||
| parser.add_argument('--graph-type', dest='graph_type', | |||
| help='Type of graphs / dataset.') | |||
| parser.set_defaults(export=False, kron_dir='', test_file='', | |||
| dir_prefix='', | |||
| graph_type=args.graph_type) | |||
| prog_args = parser.parse_args() | |||
| # dir_prefix = prog_args.dir_prefix | |||
| # dir_prefix = "/dfs/scratch0/jiaxuany0/" | |||
| dir_prefix = args.dir_input | |||
| time_now = strftime("%Y-%m-%d %H:%M:%S", gmtime()) | |||
| if not os.path.isdir('logs/'): | |||
| os.makedirs('logs/') | |||
| logging.basicConfig(filename='logs/evaluate' + time_now + '.log', level=logging.INFO) | |||
| if prog_args.export: | |||
| if not os.path.isdir('eval_results'): | |||
| os.makedirs('eval_results') | |||
| if not os.path.isdir('eval_results/ground_truth'): | |||
| os.makedirs('eval_results/ground_truth') | |||
| out_dir = os.path.join('eval_results/ground_truth', prog_args.graph_type) | |||
| if not os.path.isdir(out_dir): | |||
| os.makedirs(out_dir) | |||
| output_prefix = os.path.join(out_dir, prog_args.graph_type) | |||
| print('Export ground truth to prefix: ', output_prefix) | |||
| if prog_args.graph_type == 'grid': | |||
| graphs = [] | |||
| for i in range(10,20): | |||
| for j in range(10,20): | |||
| graphs.append(nx.grid_2d_graph(i,j)) | |||
| utils.export_graphs_to_txt(graphs, output_prefix) | |||
| elif prog_args.graph_type == 'caveman': | |||
| graphs = [] | |||
| for i in range(2, 3): | |||
| for j in range(30, 81): | |||
| for k in range(10): | |||
| graphs.append(caveman_special(i,j, p_edge=0.3)) | |||
| utils.export_graphs_to_txt(graphs, output_prefix) | |||
| elif prog_args.graph_type == 'citeseer': | |||
| graphs = utils.citeseer_ego() | |||
| utils.export_graphs_to_txt(graphs, output_prefix) | |||
| else: | |||
| # load from directory | |||
| input_path = dir_prefix + real_graph_filename | |||
| g_list = utils.load_graph_list(input_path) | |||
| utils.export_graphs_to_txt(g_list, output_prefix) | |||
| elif not prog_args.kron_dir == '': | |||
| kron_g_list = process_kron(prog_args.kron_dir) | |||
| fname = os.path.join(prog_args.kron_dir, prog_args.graph_type + '.dat') | |||
| print([g.number_of_nodes() for g in kron_g_list]) | |||
| utils.save_graph_list(kron_g_list, fname) | |||
| elif not prog_args.test_file == '': | |||
| # evaluate single .dat file containing list of test graphs (networkx format) | |||
| graphs = utils.load_graph_list(prog_args.test_file) | |||
| eval_single_list(graphs, dir_input=dir_prefix+'graphs/', dataset_name='grid') | |||
| ## if you don't try kronecker, only the following part is needed | |||
| else: | |||
| if not os.path.isdir(dir_prefix+'eval_results'): | |||
| os.makedirs(dir_prefix+'eval_results') | |||
| evaluation(args_evaluate,dir_input=dir_prefix+"graphs/", dir_output=dir_prefix+"eval_results/", | |||
| model_name_all=args_evaluate.model_name_all,dataset_name_all=args_evaluate.dataset_name_all,args=args,overwrite=True) | |||
| @@ -0,0 +1,141 @@ | |||
| from train import * | |||
| if __name__ == '__main__': | |||
| # All necessary arguments are defined in args.py | |||
| args = Args() | |||
| os.environ['CUDA_VISIBLE_DEVICES'] = str(args.cuda) | |||
| print('CUDA', args.cuda) | |||
| print('File name prefix',args.fname) | |||
| # check if necessary directories exist | |||
| if not os.path.isdir(args.model_save_path): | |||
| os.makedirs(args.model_save_path) | |||
| if not os.path.isdir(args.graph_save_path): | |||
| os.makedirs(args.graph_save_path) | |||
| if not os.path.isdir(args.figure_save_path): | |||
| os.makedirs(args.figure_save_path) | |||
| if not os.path.isdir(args.timing_save_path): | |||
| os.makedirs(args.timing_save_path) | |||
| if not os.path.isdir(args.figure_prediction_save_path): | |||
| os.makedirs(args.figure_prediction_save_path) | |||
| if not os.path.isdir(args.nll_save_path): | |||
| os.makedirs(args.nll_save_path) | |||
| time = strftime("%Y-%m-%d %H:%M:%S", gmtime()) | |||
| # logging.basicConfig(filename='logs/train' + time + '.log', level=logging.DEBUG) | |||
| if args.clean_tensorboard: | |||
| if os.path.isdir("tensorboard"): | |||
| shutil.rmtree("tensorboard") | |||
| configure("tensorboard/run"+time, flush_secs=5) | |||
| graphs = create_graphs.create(args) | |||
| # split datasets | |||
| random.seed(123) | |||
| shuffle(graphs) | |||
| graphs_len = len(graphs) | |||
| graphs_test = graphs[int(0.8 * graphs_len):] | |||
| graphs_train = graphs[0:int(0.8*graphs_len)] | |||
| graphs_validate = graphs[0:int(0.2*graphs_len)] | |||
| # if use pre-saved graphs | |||
| # dir_input = "/dfs/scratch0/jiaxuany0/graphs/" | |||
| # fname_test = dir_input + args.note + '_' + args.graph_type + '_' + str(args.num_layers) + '_' + str( | |||
| # args.hidden_size_rnn) + '_test_' + str(0) + '.dat' | |||
| # graphs = load_graph_list(fname_test, is_real=True) | |||
| # graphs_test = graphs[int(0.8 * graphs_len):] | |||
| # graphs_train = graphs[0:int(0.8 * graphs_len)] | |||
| # graphs_validate = graphs[int(0.2 * graphs_len):int(0.4 * graphs_len)] | |||
| graph_validate_len = 0 | |||
| for graph in graphs_validate: | |||
| graph_validate_len += graph.number_of_nodes() | |||
| graph_validate_len /= len(graphs_validate) | |||
| print('graph_validate_len', graph_validate_len) | |||
| graph_test_len = 0 | |||
| for graph in graphs_test: | |||
| graph_test_len += graph.number_of_nodes() | |||
| graph_test_len /= len(graphs_test) | |||
| print('graph_test_len', graph_test_len) | |||
| args.max_num_node = max([graphs[i].number_of_nodes() for i in range(len(graphs))]) | |||
| max_num_edge = max([graphs[i].number_of_edges() for i in range(len(graphs))]) | |||
| min_num_edge = min([graphs[i].number_of_edges() for i in range(len(graphs))]) | |||
| # args.max_num_node = 2000 | |||
| # show graphs statistics | |||
| print('total graph num: {}, training set: {}'.format(len(graphs),len(graphs_train))) | |||
| print('max number node: {}'.format(args.max_num_node)) | |||
| print('max/min number edge: {}; {}'.format(max_num_edge,min_num_edge)) | |||
| print('max previous node: {}'.format(args.max_prev_node)) | |||
| # save ground truth graphs | |||
| ## To get train and test set, after loading you need to manually slice | |||
| save_graph_list(graphs, args.graph_save_path + args.fname_train + '0.dat') | |||
| save_graph_list(graphs, args.graph_save_path + args.fname_test + '0.dat') | |||
| print('train and test graphs saved at: ', args.graph_save_path + args.fname_test + '0.dat') | |||
| ### comment when normal training, for graph completion only | |||
| # p = 0.5 | |||
| # for graph in graphs_train: | |||
| # for node in list(graph.nodes()): | |||
| # # print('node',node) | |||
| # if np.random.rand()>p: | |||
| # graph.remove_node(node) | |||
| # for edge in list(graph.edges()): | |||
| # # print('edge',edge) | |||
| # if np.random.rand()>p: | |||
| # graph.remove_edge(edge[0],edge[1]) | |||
| ### dataset initialization | |||
| if 'nobfs' in args.note: | |||
| print('nobfs') | |||
| dataset = Graph_sequence_sampler_pytorch_nobfs(graphs_train, max_num_node=args.max_num_node) | |||
| args.max_prev_node = args.max_num_node-1 | |||
| if 'barabasi_noise' in args.graph_type: | |||
| print('barabasi_noise') | |||
| dataset = Graph_sequence_sampler_pytorch_canonical(graphs_train,max_prev_node=args.max_prev_node) | |||
| args.max_prev_node = args.max_num_node - 1 | |||
| else: | |||
| dataset = Graph_sequence_sampler_pytorch(graphs_train,max_prev_node=args.max_prev_node,max_num_node=args.max_num_node) | |||
| sample_strategy = torch.utils.data.sampler.WeightedRandomSampler([1.0 / len(dataset) for i in range(len(dataset))], | |||
| num_samples=args.batch_size*args.batch_ratio, replacement=True) | |||
| dataset_loader = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, num_workers=args.num_workers, | |||
| sampler=sample_strategy) | |||
| ### model initialization | |||
| ## Graph RNN VAE model | |||
| # lstm = LSTM_plain(input_size=args.max_prev_node, embedding_size=args.embedding_size_lstm, | |||
| # hidden_size=args.hidden_size, num_layers=args.num_layers).cuda() | |||
| if 'GraphRNN_VAE_conditional' in args.note: | |||
| rnn = GRU_plain(input_size=args.max_prev_node, embedding_size=args.embedding_size_rnn, | |||
| hidden_size=args.hidden_size_rnn, num_layers=args.num_layers, has_input=True, | |||
| has_output=False).cuda() | |||
| output = MLP_VAE_conditional_plain(h_size=args.hidden_size_rnn, embedding_size=args.embedding_size_output, y_size=args.max_prev_node).cuda() | |||
| elif 'GraphRNN_MLP' in args.note: | |||
| rnn = GRU_plain(input_size=args.max_prev_node, embedding_size=args.embedding_size_rnn, | |||
| hidden_size=args.hidden_size_rnn, num_layers=args.num_layers, has_input=True, | |||
| has_output=False).cuda() | |||
| output = MLP_plain(h_size=args.hidden_size_rnn, embedding_size=args.embedding_size_output, y_size=args.max_prev_node).cuda() | |||
| elif 'GraphRNN_RNN' in args.note: | |||
| rnn = GRU_plain(input_size=args.max_prev_node, embedding_size=args.embedding_size_rnn, | |||
| hidden_size=args.hidden_size_rnn, num_layers=args.num_layers, has_input=True, | |||
| has_output=True, output_size=args.hidden_size_rnn_output).cuda() | |||
| output = GRU_plain(input_size=1, embedding_size=args.embedding_size_rnn_output, | |||
| hidden_size=args.hidden_size_rnn_output, num_layers=args.num_layers, has_input=True, | |||
| has_output=True, output_size=1).cuda() | |||
| ### start training | |||
| train(args, dataset_loader, rnn, output) | |||
| ### graph completion | |||
| # train_graph_completion(args,dataset_loader,rnn,output) | |||
| ### nll evaluation | |||
| # train_nll(args, dataset_loader, dataset_loader, rnn, output, max_iter = 200, graph_validate_len=graph_validate_len,graph_test_len=graph_test_len) | |||
| @@ -0,0 +1,594 @@ | |||
| # an implementation for "Learning Deep Generative Models of Graphs" | |||
| from main import * | |||
| class Args_DGMG(): | |||
| def __init__(self): | |||
| ### CUDA | |||
| self.cuda = 2 | |||
| ### model type | |||
| self.note = 'Baseline_DGMG' # do GCN after adding each edge | |||
| # self.note = 'Baseline_DGMG_fast' # do GCN only after adding each node | |||
| ### data config | |||
| self.graph_type = 'caveman_small' | |||
| # self.graph_type = 'grid_small' | |||
| # self.graph_type = 'ladder_small' | |||
| # self.graph_type = 'enzymes_small' | |||
| # self.graph_type = 'barabasi_small' | |||
| # self.graph_type = 'citeseer_small' | |||
| self.max_num_node = 20 | |||
| ### network config | |||
| self.node_embedding_size = 64 | |||
| self.test_graph_num = 200 | |||
| ### training config | |||
| self.epochs = 2000 # now one epoch means self.batch_ratio x batch_size | |||
| self.load_epoch = 2000 | |||
| self.epochs_test_start = 100 | |||
| self.epochs_test = 100 | |||
| self.epochs_log = 100 | |||
| self.epochs_save = 100 | |||
| if 'fast' in self.note: | |||
| self.is_fast = True | |||
| else: | |||
| self.is_fast = False | |||
| self.lr = 0.001 | |||
| self.milestones = [300, 600, 1000] | |||
| self.lr_rate = 0.3 | |||
| ### output config | |||
| self.model_save_path = 'model_save/' | |||
| self.graph_save_path = 'graphs/' | |||
| self.figure_save_path = 'figures/' | |||
| self.timing_save_path = 'timing/' | |||
| self.figure_prediction_save_path = 'figures_prediction/' | |||
| self.nll_save_path = 'nll/' | |||
| self.fname = self.note + '_' + self.graph_type + '_' + str(self.node_embedding_size) | |||
| self.fname_pred = self.note + '_' + self.graph_type + '_' + str(self.node_embedding_size) + '_pred_' | |||
| self.fname_train = self.note + '_' + self.graph_type + '_' + str(self.node_embedding_size) + '_train_' | |||
| self.fname_test = self.note + '_' + self.graph_type + '_' + str(self.node_embedding_size) + '_test_' | |||
| self.load = False | |||
| self.save = True | |||
| def train_DGMG_epoch(epoch, args, model, dataset, optimizer, scheduler, is_fast = False): | |||
| model.train() | |||
| graph_num = len(dataset) | |||
| order = list(range(graph_num)) | |||
| shuffle(order) | |||
| loss_addnode = 0 | |||
| loss_addedge = 0 | |||
| loss_node = 0 | |||
| for i in order: | |||
| model.zero_grad() | |||
| graph = dataset[i] | |||
| # do random ordering: relabel nodes | |||
| node_order = list(range(graph.number_of_nodes())) | |||
| shuffle(node_order) | |||
| order_mapping = dict(zip(graph.nodes(), node_order)) | |||
| graph = nx.relabel_nodes(graph, order_mapping, copy=True) | |||
| # NOTE: when starting loop, we assume a node has already been generated | |||
| node_count = 1 | |||
| node_embedding = [Variable(torch.ones(1,args.node_embedding_size)).cuda()] # list of torch tensors, each size: 1*hidden | |||
| loss = 0 | |||
| while node_count<=graph.number_of_nodes(): | |||
| node_neighbor = graph.subgraph(list(range(node_count))).adjacency_list() # list of lists (first node is zero) | |||
| node_neighbor_new = graph.subgraph(list(range(node_count+1))).adjacency_list()[-1] # list of new node's neighbors | |||
| # 1 message passing | |||
| # do 2 times message passing | |||
| node_embedding = message_passing(node_neighbor, node_embedding, model) | |||
| # 2 graph embedding and new node embedding | |||
| node_embedding_cat = torch.cat(node_embedding, dim=0) | |||
| graph_embedding = calc_graph_embedding(node_embedding_cat, model) | |||
| init_embedding = calc_init_embedding(node_embedding_cat, model) | |||
| # 3 f_addnode | |||
| p_addnode = model.f_an(graph_embedding) | |||
| if node_count < graph.number_of_nodes(): | |||
| # add node | |||
| node_neighbor.append([]) | |||
| node_embedding.append(init_embedding) | |||
| if is_fast: | |||
| node_embedding_cat = torch.cat(node_embedding, dim=0) | |||
| # calc loss | |||
| loss_addnode_step = F.binary_cross_entropy(p_addnode,Variable(torch.ones((1,1))).cuda()) | |||
| # loss_addnode_step.backward(retain_graph=True) | |||
| loss += loss_addnode_step | |||
| loss_addnode += loss_addnode_step.data | |||
| else: | |||
| # calc loss | |||
| loss_addnode_step = F.binary_cross_entropy(p_addnode, Variable(torch.zeros((1, 1))).cuda()) | |||
| # loss_addnode_step.backward(retain_graph=True) | |||
| loss += loss_addnode_step | |||
| loss_addnode += loss_addnode_step.data | |||
| break | |||
| edge_count = 0 | |||
| while edge_count<=len(node_neighbor_new): | |||
| if not is_fast: | |||
| node_embedding = message_passing(node_neighbor, node_embedding, model) | |||
| node_embedding_cat = torch.cat(node_embedding, dim=0) | |||
| graph_embedding = calc_graph_embedding(node_embedding_cat, model) | |||
| # 4 f_addedge | |||
| p_addedge = model.f_ae(graph_embedding) | |||
| if edge_count < len(node_neighbor_new): | |||
| # calc loss | |||
| loss_addedge_step = F.binary_cross_entropy(p_addedge, Variable(torch.ones((1, 1))).cuda()) | |||
| # loss_addedge_step.backward(retain_graph=True) | |||
| loss += loss_addedge_step | |||
| loss_addedge += loss_addedge_step.data | |||
| # 5 f_nodes | |||
| # excluding the last node (which is the new node) | |||
| node_new_embedding_cat = node_embedding_cat[-1,:].expand(node_embedding_cat.size(0)-1,node_embedding_cat.size(1)) | |||
| s_node = model.f_s(torch.cat((node_embedding_cat[0:-1,:],node_new_embedding_cat),dim=1)) | |||
| p_node = F.softmax(s_node.permute(1,0)) | |||
| # get ground truth | |||
| a_node = torch.zeros((1,p_node.size(1))) | |||
| # print('node_neighbor_new',node_neighbor_new, edge_count) | |||
| a_node[0,node_neighbor_new[edge_count]] = 1 | |||
| a_node = Variable(a_node).cuda() | |||
| # add edge | |||
| node_neighbor[-1].append(node_neighbor_new[edge_count]) | |||
| node_neighbor[node_neighbor_new[edge_count]].append(len(node_neighbor)-1) | |||
| # calc loss | |||
| loss_node_step = F.binary_cross_entropy(p_node,a_node) | |||
| # loss_node_step.backward(retain_graph=True) | |||
| loss += loss_node_step | |||
| loss_node += loss_node_step.data | |||
| else: | |||
| # calc loss | |||
| loss_addedge_step = F.binary_cross_entropy(p_addedge, Variable(torch.zeros((1, 1))).cuda()) | |||
| # loss_addedge_step.backward(retain_graph=True) | |||
| loss += loss_addedge_step | |||
| loss_addedge += loss_addedge_step.data | |||
| break | |||
| edge_count += 1 | |||
| node_count += 1 | |||
| # update deterministic and lstm | |||
| loss.backward() | |||
| optimizer.step() | |||
| scheduler.step() | |||
| loss_all = loss_addnode + loss_addedge + loss_node | |||
| if epoch % args.epochs_log==0: | |||
| print('Epoch: {}/{}, train loss: {:.6f}, graph type: {}, hidden: {}'.format( | |||
| epoch, args.epochs,loss_all[0], args.graph_type, args.node_embedding_size)) | |||
| # loss_sum += loss.data[0]*x.size(0) | |||
| # return loss_sum | |||
| def train_DGMG_forward_epoch(args, model, dataset, is_fast = False): | |||
| model.train() | |||
| graph_num = len(dataset) | |||
| order = list(range(graph_num)) | |||
| shuffle(order) | |||
| loss_addnode = 0 | |||
| loss_addedge = 0 | |||
| loss_node = 0 | |||
| for i in order: | |||
| model.zero_grad() | |||
| graph = dataset[i] | |||
| # do random ordering: relabel nodes | |||
| node_order = list(range(graph.number_of_nodes())) | |||
| shuffle(node_order) | |||
| order_mapping = dict(zip(graph.nodes(), node_order)) | |||
| graph = nx.relabel_nodes(graph, order_mapping, copy=True) | |||
| # NOTE: when starting loop, we assume a node has already been generated | |||
| node_count = 1 | |||
| node_embedding = [Variable(torch.ones(1,args.node_embedding_size)).cuda()] # list of torch tensors, each size: 1*hidden | |||
| loss = 0 | |||
| while node_count<=graph.number_of_nodes(): | |||
| node_neighbor = graph.subgraph(list(range(node_count))).adjacency_list() # list of lists (first node is zero) | |||
| node_neighbor_new = graph.subgraph(list(range(node_count+1))).adjacency_list()[-1] # list of new node's neighbors | |||
| # 1 message passing | |||
| # do 2 times message passing | |||
| node_embedding = message_passing(node_neighbor, node_embedding, model) | |||
| # 2 graph embedding and new node embedding | |||
| node_embedding_cat = torch.cat(node_embedding, dim=0) | |||
| graph_embedding = calc_graph_embedding(node_embedding_cat, model) | |||
| init_embedding = calc_init_embedding(node_embedding_cat, model) | |||
| # 3 f_addnode | |||
| p_addnode = model.f_an(graph_embedding) | |||
| if node_count < graph.number_of_nodes(): | |||
| # add node | |||
| node_neighbor.append([]) | |||
| node_embedding.append(init_embedding) | |||
| if is_fast: | |||
| node_embedding_cat = torch.cat(node_embedding, dim=0) | |||
| # calc loss | |||
| loss_addnode_step = F.binary_cross_entropy(p_addnode,Variable(torch.ones((1,1))).cuda()) | |||
| # loss_addnode_step.backward(retain_graph=True) | |||
| loss += loss_addnode_step | |||
| loss_addnode += loss_addnode_step.data | |||
| else: | |||
| # calc loss | |||
| loss_addnode_step = F.binary_cross_entropy(p_addnode, Variable(torch.zeros((1, 1))).cuda()) | |||
| # loss_addnode_step.backward(retain_graph=True) | |||
| loss += loss_addnode_step | |||
| loss_addnode += loss_addnode_step.data | |||
| break | |||
| edge_count = 0 | |||
| while edge_count<=len(node_neighbor_new): | |||
| if not is_fast: | |||
| node_embedding = message_passing(node_neighbor, node_embedding, model) | |||
| node_embedding_cat = torch.cat(node_embedding, dim=0) | |||
| graph_embedding = calc_graph_embedding(node_embedding_cat, model) | |||
| # 4 f_addedge | |||
| p_addedge = model.f_ae(graph_embedding) | |||
| if edge_count < len(node_neighbor_new): | |||
| # calc loss | |||
| loss_addedge_step = F.binary_cross_entropy(p_addedge, Variable(torch.ones((1, 1))).cuda()) | |||
| # loss_addedge_step.backward(retain_graph=True) | |||
| loss += loss_addedge_step | |||
| loss_addedge += loss_addedge_step.data | |||
| # 5 f_nodes | |||
| # excluding the last node (which is the new node) | |||
| node_new_embedding_cat = node_embedding_cat[-1,:].expand(node_embedding_cat.size(0)-1,node_embedding_cat.size(1)) | |||
| s_node = model.f_s(torch.cat((node_embedding_cat[0:-1,:],node_new_embedding_cat),dim=1)) | |||
| p_node = F.softmax(s_node.permute(1,0)) | |||
| # get ground truth | |||
| a_node = torch.zeros((1,p_node.size(1))) | |||
| # print('node_neighbor_new',node_neighbor_new, edge_count) | |||
| a_node[0,node_neighbor_new[edge_count]] = 1 | |||
| a_node = Variable(a_node).cuda() | |||
| # add edge | |||
| node_neighbor[-1].append(node_neighbor_new[edge_count]) | |||
| node_neighbor[node_neighbor_new[edge_count]].append(len(node_neighbor)-1) | |||
| # calc loss | |||
| loss_node_step = F.binary_cross_entropy(p_node,a_node) | |||
| # loss_node_step.backward(retain_graph=True) | |||
| loss += loss_node_step | |||
| loss_node += loss_node_step.data*p_node.size(1) | |||
| else: | |||
| # calc loss | |||
| loss_addedge_step = F.binary_cross_entropy(p_addedge, Variable(torch.zeros((1, 1))).cuda()) | |||
| # loss_addedge_step.backward(retain_graph=True) | |||
| loss += loss_addedge_step | |||
| loss_addedge += loss_addedge_step.data | |||
| break | |||
| edge_count += 1 | |||
| node_count += 1 | |||
| loss_all = loss_addnode + loss_addedge + loss_node | |||
| # if epoch % args.epochs_log==0: | |||
| # print('Epoch: {}/{}, train loss: {:.6f}, graph type: {}, hidden: {}'.format( | |||
| # epoch, args.epochs,loss_all[0], args.graph_type, args.node_embedding_size)) | |||
| return loss_all[0]/len(dataset) | |||
| def test_DGMG_epoch(args, model, is_fast=False): | |||
| model.eval() | |||
| graph_num = args.test_graph_num | |||
| graphs_generated = [] | |||
| for i in range(graph_num): | |||
| # NOTE: when starting loop, we assume a node has already been generated | |||
| node_neighbor = [[]] # list of lists (first node is zero) | |||
| node_embedding = [Variable(torch.ones(1,args.node_embedding_size)).cuda()] # list of torch tensors, each size: 1*hidden | |||
| node_count = 1 | |||
| while node_count<=args.max_num_node: | |||
| # 1 message passing | |||
| # do 2 times message passing | |||
| node_embedding = message_passing(node_neighbor, node_embedding, model) | |||
| # 2 graph embedding and new node embedding | |||
| node_embedding_cat = torch.cat(node_embedding, dim=0) | |||
| graph_embedding = calc_graph_embedding(node_embedding_cat, model) | |||
| init_embedding = calc_init_embedding(node_embedding_cat, model) | |||
| # 3 f_addnode | |||
| p_addnode = model.f_an(graph_embedding) | |||
| a_addnode = sample_tensor(p_addnode) | |||
| # print(a_addnode.data[0][0]) | |||
| if a_addnode.data[0][0]==1: | |||
| # print('add node') | |||
| # add node | |||
| node_neighbor.append([]) | |||
| node_embedding.append(init_embedding) | |||
| if is_fast: | |||
| node_embedding_cat = torch.cat(node_embedding, dim=0) | |||
| else: | |||
| break | |||
| edge_count = 0 | |||
| while edge_count<args.max_num_node: | |||
| if not is_fast: | |||
| node_embedding = message_passing(node_neighbor, node_embedding, model) | |||
| node_embedding_cat = torch.cat(node_embedding, dim=0) | |||
| graph_embedding = calc_graph_embedding(node_embedding_cat, model) | |||
| # 4 f_addedge | |||
| p_addedge = model.f_ae(graph_embedding) | |||
| a_addedge = sample_tensor(p_addedge) | |||
| # print(a_addedge.data[0][0]) | |||
| if a_addedge.data[0][0]==1: | |||
| # print('add edge') | |||
| # 5 f_nodes | |||
| # excluding the last node (which is the new node) | |||
| node_new_embedding_cat = node_embedding_cat[-1,:].expand(node_embedding_cat.size(0)-1,node_embedding_cat.size(1)) | |||
| s_node = model.f_s(torch.cat((node_embedding_cat[0:-1,:],node_new_embedding_cat),dim=1)) | |||
| p_node = F.softmax(s_node.permute(1,0)) | |||
| a_node = gumbel_softmax(p_node, temperature=0.01) | |||
| _, a_node_id = a_node.topk(1) | |||
| a_node_id = int(a_node_id.data[0][0]) | |||
| # add edge | |||
| node_neighbor[-1].append(a_node_id) | |||
| node_neighbor[a_node_id].append(len(node_neighbor)-1) | |||
| else: | |||
| break | |||
| edge_count += 1 | |||
| node_count += 1 | |||
| # save graph | |||
| node_neighbor_dict = dict(zip(list(range(len(node_neighbor))), node_neighbor)) | |||
| graph = nx.from_dict_of_lists(node_neighbor_dict) | |||
| graphs_generated.append(graph) | |||
| return graphs_generated | |||
| ########### train function for LSTM + VAE | |||
| def train_DGMG(args, dataset_train, model): | |||
| # check if load existing model | |||
| if args.load: | |||
| fname = args.model_save_path + args.fname + 'model_' + str(args.load_epoch) + '.dat' | |||
| model.load_state_dict(torch.load(fname)) | |||
| args.lr = 0.00001 | |||
| epoch = args.load_epoch | |||
| print('model loaded!, lr: {}'.format(args.lr)) | |||
| else: | |||
| epoch = 1 | |||
| # initialize optimizer | |||
| optimizer = optim.Adam(list(model.parameters()), lr=args.lr) | |||
| scheduler = MultiStepLR(optimizer, milestones=args.milestones, gamma=args.lr_rate) | |||
| # start main loop | |||
| time_all = np.zeros(args.epochs) | |||
| while epoch <= args.epochs: | |||
| time_start = tm.time() | |||
| # train | |||
| train_DGMG_epoch(epoch, args, model, dataset_train, optimizer, scheduler, is_fast=args.is_fast) | |||
| time_end = tm.time() | |||
| time_all[epoch - 1] = time_end - time_start | |||
| # print('time used',time_all[epoch - 1]) | |||
| # test | |||
| if epoch % args.epochs_test == 0 and epoch >= args.epochs_test_start: | |||
| graphs = test_DGMG_epoch(args,model, is_fast=args.is_fast) | |||
| fname = args.graph_save_path + args.fname_pred + str(epoch) + '.dat' | |||
| save_graph_list(graphs, fname) | |||
| # print('test done, graphs saved') | |||
| # save model checkpoint | |||
| if args.save: | |||
| if epoch % args.epochs_save == 0: | |||
| fname = args.model_save_path + args.fname + 'model_' + str(epoch) + '.dat' | |||
| torch.save(model.state_dict(), fname) | |||
| epoch += 1 | |||
| np.save(args.timing_save_path + args.fname, time_all) | |||
| ########### train function for LSTM + VAE | |||
| def train_DGMG_nll(args, dataset_train,dataset_test, model,max_iter=1000): | |||
| # check if load existing model | |||
| fname = args.model_save_path + args.fname + 'model_' + str(args.load_epoch) + '.dat' | |||
| model.load_state_dict(torch.load(fname)) | |||
| fname_output = args.nll_save_path + args.note + '_' + args.graph_type + '.csv' | |||
| with open(fname_output, 'w+') as f: | |||
| f.write('train,test\n') | |||
| # start main loop | |||
| for iter in range(max_iter): | |||
| nll_train = train_DGMG_forward_epoch(args, model, dataset_train, is_fast=args.is_fast) | |||
| nll_test = train_DGMG_forward_epoch(args, model, dataset_test, is_fast=args.is_fast) | |||
| print('train', nll_train, 'test', nll_test) | |||
| f.write(str(nll_train) + ',' + str(nll_test) + '\n') | |||
| if __name__ == '__main__': | |||
| args = Args_DGMG() | |||
| os.environ['CUDA_VISIBLE_DEVICES'] = str(args.cuda) | |||
| print('CUDA', args.cuda) | |||
| print('File name prefix',args.fname) | |||
| graphs = [] | |||
| for i in range(4, 10): | |||
| graphs.append(nx.ladder_graph(i)) | |||
| model = DGM_graphs(h_size = args.node_embedding_size).cuda() | |||
| if args.graph_type == 'ladder_small': | |||
| graphs = [] | |||
| for i in range(2, 11): | |||
| graphs.append(nx.ladder_graph(i)) | |||
| args.max_prev_node = 10 | |||
| # if args.graph_type == 'caveman_small': | |||
| # graphs = [] | |||
| # for i in range(2, 5): | |||
| # for j in range(2, 6): | |||
| # for k in range(10): | |||
| # graphs.append(nx.relaxed_caveman_graph(i, j, p=0.1)) | |||
| # args.max_prev_node = 20 | |||
| if args.graph_type=='caveman_small': | |||
| graphs = [] | |||
| for i in range(2, 3): | |||
| for j in range(6, 11): | |||
| for k in range(20): | |||
| graphs.append(caveman_special(i, j, p_edge=0.8)) | |||
| args.max_prev_node = 20 | |||
| if args.graph_type == 'grid_small': | |||
| graphs = [] | |||
| for i in range(2, 5): | |||
| for j in range(2, 6): | |||
| graphs.append(nx.grid_2d_graph(i, j)) | |||
| args.max_prev_node = 15 | |||
| if args.graph_type == 'barabasi_small': | |||
| graphs = [] | |||
| for i in range(4, 21): | |||
| for j in range(3, 4): | |||
| for k in range(10): | |||
| graphs.append(nx.barabasi_albert_graph(i, j)) | |||
| args.max_prev_node = 20 | |||
| if args.graph_type == 'enzymes_small': | |||
| graphs_raw = Graph_load_batch(min_num_nodes=10, name='ENZYMES') | |||
| graphs = [] | |||
| for G in graphs_raw: | |||
| if G.number_of_nodes()<=20: | |||
| graphs.append(G) | |||
| args.max_prev_node = 15 | |||
| if args.graph_type == 'citeseer_small': | |||
| _, _, G = Graph_load(dataset='citeseer') | |||
| G = max(nx.connected_component_subgraphs(G), key=len) | |||
| G = nx.convert_node_labels_to_integers(G) | |||
| graphs = [] | |||
| for i in range(G.number_of_nodes()): | |||
| G_ego = nx.ego_graph(G, i, radius=1) | |||
| if (G_ego.number_of_nodes() >= 4) and (G_ego.number_of_nodes() <= 20): | |||
| graphs.append(G_ego) | |||
| shuffle(graphs) | |||
| graphs = graphs[0:200] | |||
| args.max_prev_node = 15 | |||
| # remove self loops | |||
| for graph in graphs: | |||
| edges_with_selfloops = graph.selfloop_edges() | |||
| if len(edges_with_selfloops) > 0: | |||
| graph.remove_edges_from(edges_with_selfloops) | |||
| # split datasets | |||
| random.seed(123) | |||
| shuffle(graphs) | |||
| graphs_len = len(graphs) | |||
| graphs_test = graphs[int(0.8 * graphs_len):] | |||
| graphs_train = graphs[0:int(0.8 * graphs_len)] | |||
| args.max_num_node = max([graphs[i].number_of_nodes() for i in range(len(graphs))]) | |||
| # args.max_num_node = 2000 | |||
| # show graphs statistics | |||
| print('total graph num: {}, training set: {}'.format(len(graphs), len(graphs_train))) | |||
| print('max number node: {}'.format(args.max_num_node)) | |||
| print('max previous node: {}'.format(args.max_prev_node)) | |||
| # save ground truth graphs | |||
| # save_graph_list(graphs, args.graph_save_path + args.fname_train + '0.dat') | |||
| # save_graph_list(graphs, args.graph_save_path + args.fname_test + '0.dat') | |||
| # print('train and test graphs saved') | |||
| ## if use pre-saved graphs | |||
| # dir_input = "graphs/" | |||
| # fname_test = args.graph_save_path + args.fname_test + '0.dat' | |||
| # graphs = load_graph_list(fname_test, is_real=True) | |||
| # graphs_test = graphs[int(0.8 * graphs_len):] | |||
| # graphs_train = graphs[0:int(0.8 * graphs_len)] | |||
| # graphs_validate = graphs[0:int(0.2 * graphs_len)] | |||
| # print('train') | |||
| # for graph in graphs_validate: | |||
| # print(graph.number_of_nodes()) | |||
| # print('test') | |||
| # for graph in graphs_test: | |||
| # print(graph.number_of_nodes()) | |||
| ### train | |||
| train_DGMG(args,graphs,model) | |||
| ### calc nll | |||
| # train_DGMG_nll(args, graphs_validate,graphs_test, model,max_iter=1000) | |||
| # for j in range(1000): | |||
| # graph = graphs[0] | |||
| # # do random ordering: relabel nodes | |||
| # node_order = list(range(graph.number_of_nodes())) | |||
| # shuffle(node_order) | |||
| # order_mapping = dict(zip(graph.nodes(), node_order)) | |||
| # graph = nx.relabel_nodes(graph, order_mapping, copy=True) | |||
| # print(graph.nodes()) | |||
| @@ -0,0 +1,50 @@ | |||
| import numpy as np | |||
| import matplotlib as mpl | |||
| import matplotlib.pyplot as plt | |||
| import seaborn as sns | |||
| sns.set() | |||
| sns.set_style("ticks") | |||
| sns.set_context("poster",font_scale=1.28,rc={"lines.linewidth": 3}) | |||
| ### plot robustness result | |||
| noise = np.array([0,0.2,0.4,0.6,0.8,1.0]) | |||
| MLP_degree = np.array([0.3440, 0.1365, 0.0663, 0.0430, 0.0214, 0.0201]) | |||
| RNN_degree = np.array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5]) | |||
| BA_degree = np.array([0.0892,0.3558,1.1754,1.5914,1.7037,1.7502]) | |||
| Gnp_degree = np.array([1.7115,1.5536,0.5529,0.1433,0.0725,0.0503]) | |||
| MLP_clustering = np.array([0.0096, 0.0056, 0.0027, 0.0020, 0.0012, 0.0028]) | |||
| RNN_clustering = np.array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5]) | |||
| BA_clustering = np.array([0.0255,0.0881,0.3433,0.4237,0.6041,0.7851]) | |||
| Gnp_clustering = np.array([0.7683,0.1849,0.1081,0.0146,0.0210,0.0329]) | |||
| plt.plot(noise,Gnp_degree) | |||
| plt.plot(noise,BA_degree) | |||
| plt.plot(noise, MLP_degree) | |||
| # plt.plot(noise, RNN_degree) | |||
| # plt.rc('text', usetex=True) | |||
| plt.legend(['E-R','B-A','GraphRNN']) | |||
| plt.xlabel('Noise level') | |||
| plt.ylabel('MMD degree') | |||
| plt.tight_layout() | |||
| plt.savefig('figures_paper/robustness_degree.png',dpi=300) | |||
| plt.close() | |||
| plt.plot(noise,Gnp_clustering) | |||
| plt.plot(noise,BA_clustering) | |||
| plt.plot(noise, MLP_clustering) | |||
| # plt.plot(noise, RNN_clustering) | |||
| plt.legend(['E-R','B-A','GraphRNN']) | |||
| plt.xlabel('Noise level') | |||
| plt.ylabel('MMD clustering') | |||
| plt.tight_layout() | |||
| plt.savefig('figures_paper/robustness_clustering.png',dpi=300) | |||
| plt.close() | |||
| @@ -0,0 +1,4 @@ | |||
| tensorboard-logger | |||
| tensorflow | |||
| networkx==1.11 | |||
| pyemd | |||
| @@ -0,0 +1,55 @@ | |||
| import torch | |||
| import numpy as np | |||
| import time | |||
| def compute_kernel(x,y): | |||
| x_size = x.size(0) | |||
| y_size = y.size(0) | |||
| dim = x.size(1) | |||
| x_tile = x.view(x_size,1,dim) | |||
| x_tile = x_tile.repeat(1,y_size,1) | |||
| y_tile = y.view(1,y_size,dim) | |||
| y_tile = y_tile.repeat(x_size,1,1) | |||
| return torch.exp(-torch.mean((x_tile-y_tile)**2,dim = 2)/float(dim)) | |||
| def compute_mmd(x,y): | |||
| x_kernel = compute_kernel(x,x) | |||
| # print(x_kernel) | |||
| y_kernel = compute_kernel(y,y) | |||
| # print(y_kernel) | |||
| xy_kernel = compute_kernel(x,y) | |||
| # print(xy_kernel) | |||
| return torch.mean(x_kernel)+torch.mean(y_kernel)-2*torch.mean(xy_kernel) | |||
| # start = time.time() | |||
| # x = torch.randn(4000,1).cuda() | |||
| # y = torch.randn(4000,1).cuda() | |||
| # print(compute_mmd(x,y)) | |||
| # end = time.time() | |||
| # print('GPU time:', end-start) | |||
| start = time.time() | |||
| torch.manual_seed(123) | |||
| batch = 1000 | |||
| x = torch.randn(batch,1) | |||
| y_baseline = torch.randn(batch,1) | |||
| y_pred = torch.zeros(batch,1) | |||
| print('MMD baseline', compute_mmd(x,y_baseline)) | |||
| print('MMD prediction', compute_mmd(x,y_pred)) | |||
| # | |||
| # print('before',x) | |||
| # print('MMD', compute_mmd(x,y)) | |||
| # x_idx = np.random.permutation(x.size(0)) | |||
| # x = x[x_idx,:] | |||
| # print('after permutation',x) | |||
| # print('MMD', compute_mmd(x,y)) | |||
| # | |||
| # | |||
| # end = time.time() | |||
| # print('CPU time:', end-start) | |||
| @@ -0,0 +1,760 @@ | |||
| import networkx as nx | |||
| import numpy as np | |||
| import torch | |||
| import torch.nn as nn | |||
| import torch.nn.init as init | |||
| from torch.autograd import Variable | |||
| import matplotlib.pyplot as plt | |||
| import torch.nn.functional as F | |||
| from torch import optim | |||
| from torch.optim.lr_scheduler import MultiStepLR | |||
| from sklearn.decomposition import PCA | |||
| import logging | |||
| from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence | |||
| from time import gmtime, strftime | |||
| from sklearn.metrics import roc_curve | |||
| from sklearn.metrics import roc_auc_score | |||
| from sklearn.metrics import average_precision_score | |||
| from random import shuffle | |||
| import pickle | |||
| from tensorboard_logger import configure, log_value | |||
| import scipy.misc | |||
| import time as tm | |||
| from utils import * | |||
| from model import * | |||
| from data import * | |||
| from args import Args | |||
| import create_graphs | |||
| def train_vae_epoch(epoch, args, rnn, output, data_loader, | |||
| optimizer_rnn, optimizer_output, | |||
| scheduler_rnn, scheduler_output): | |||
| rnn.train() | |||
| output.train() | |||
| loss_sum = 0 | |||
| for batch_idx, data in enumerate(data_loader): | |||
| rnn.zero_grad() | |||
| output.zero_grad() | |||
| x_unsorted = data['x'].float() | |||
| y_unsorted = data['y'].float() | |||
| y_len_unsorted = data['len'] | |||
| y_len_max = max(y_len_unsorted) | |||
| x_unsorted = x_unsorted[:, 0:y_len_max, :] | |||
| y_unsorted = y_unsorted[:, 0:y_len_max, :] | |||
| # initialize lstm hidden state according to batch size | |||
| rnn.hidden = rnn.init_hidden(batch_size=x_unsorted.size(0)) | |||
| # sort input | |||
| y_len,sort_index = torch.sort(y_len_unsorted,0,descending=True) | |||
| y_len = y_len.numpy().tolist() | |||
| x = torch.index_select(x_unsorted,0,sort_index) | |||
| y = torch.index_select(y_unsorted,0,sort_index) | |||
| x = Variable(x).cuda() | |||
| y = Variable(y).cuda() | |||
| # if using ground truth to train | |||
| h = rnn(x, pack=True, input_len=y_len) | |||
| y_pred,z_mu,z_lsgms = output(h) | |||
| y_pred = F.sigmoid(y_pred) | |||
| # clean | |||
| y_pred = pack_padded_sequence(y_pred, y_len, batch_first=True) | |||
| y_pred = pad_packed_sequence(y_pred, batch_first=True)[0] | |||
| z_mu = pack_padded_sequence(z_mu, y_len, batch_first=True) | |||
| z_mu = pad_packed_sequence(z_mu, batch_first=True)[0] | |||
| z_lsgms = pack_padded_sequence(z_lsgms, y_len, batch_first=True) | |||
| z_lsgms = pad_packed_sequence(z_lsgms, batch_first=True)[0] | |||
| # use cross entropy loss | |||
| loss_bce = binary_cross_entropy_weight(y_pred, y) | |||
| loss_kl = -0.5 * torch.sum(1 + z_lsgms - z_mu.pow(2) - z_lsgms.exp()) | |||
| loss_kl /= y.size(0)*y.size(1)*sum(y_len) # normalize | |||
| loss = loss_bce + loss_kl | |||
| loss.backward() | |||
| # update deterministic and lstm | |||
| optimizer_output.step() | |||
| optimizer_rnn.step() | |||
| scheduler_output.step() | |||
| scheduler_rnn.step() | |||
| z_mu_mean = torch.mean(z_mu.data) | |||
| z_sgm_mean = torch.mean(z_lsgms.mul(0.5).exp_().data) | |||
| z_mu_min = torch.min(z_mu.data) | |||
| z_sgm_min = torch.min(z_lsgms.mul(0.5).exp_().data) | |||
| z_mu_max = torch.max(z_mu.data) | |||
| z_sgm_max = torch.max(z_lsgms.mul(0.5).exp_().data) | |||
| if epoch % args.epochs_log==0 and batch_idx==0: # only output first batch's statistics | |||
| print('Epoch: {}/{}, train bce loss: {:.6f}, train kl loss: {:.6f}, graph type: {}, num_layer: {}, hidden: {}'.format( | |||
| epoch, args.epochs,loss_bce.data[0], loss_kl.data[0], args.graph_type, args.num_layers, args.hidden_size_rnn)) | |||
| print('z_mu_mean', z_mu_mean, 'z_mu_min', z_mu_min, 'z_mu_max', z_mu_max, 'z_sgm_mean', z_sgm_mean, 'z_sgm_min', z_sgm_min, 'z_sgm_max', z_sgm_max) | |||
| # logging | |||
| log_value('bce_loss_'+args.fname, loss_bce.data[0], epoch*args.batch_ratio+batch_idx) | |||
| log_value('kl_loss_' +args.fname, loss_kl.data[0], epoch*args.batch_ratio + batch_idx) | |||
| log_value('z_mu_mean_'+args.fname, z_mu_mean, epoch*args.batch_ratio + batch_idx) | |||
| log_value('z_mu_min_'+args.fname, z_mu_min, epoch*args.batch_ratio + batch_idx) | |||
| log_value('z_mu_max_'+args.fname, z_mu_max, epoch*args.batch_ratio + batch_idx) | |||
| log_value('z_sgm_mean_'+args.fname, z_sgm_mean, epoch*args.batch_ratio + batch_idx) | |||
| log_value('z_sgm_min_'+args.fname, z_sgm_min, epoch*args.batch_ratio + batch_idx) | |||
| log_value('z_sgm_max_'+args.fname, z_sgm_max, epoch*args.batch_ratio + batch_idx) | |||
| loss_sum += loss.data[0] | |||
| return loss_sum/(batch_idx+1) | |||
| def test_vae_epoch(epoch, args, rnn, output, test_batch_size=16, save_histogram=False, sample_time = 1): | |||
| rnn.hidden = rnn.init_hidden(test_batch_size) | |||
| rnn.eval() | |||
| output.eval() | |||
| # generate graphs | |||
| max_num_node = int(args.max_num_node) | |||
| y_pred = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # normalized prediction score | |||
| y_pred_long = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # discrete prediction | |||
| x_step = Variable(torch.ones(test_batch_size,1,args.max_prev_node)).cuda() | |||
| for i in range(max_num_node): | |||
| h = rnn(x_step) | |||
| y_pred_step, _, _ = output(h) | |||
| y_pred[:, i:i + 1, :] = F.sigmoid(y_pred_step) | |||
| x_step = sample_sigmoid(y_pred_step, sample=True, sample_time=sample_time) | |||
| y_pred_long[:, i:i + 1, :] = x_step | |||
| rnn.hidden = Variable(rnn.hidden.data).cuda() | |||
| y_pred_data = y_pred.data | |||
| y_pred_long_data = y_pred_long.data.long() | |||
| # save graphs as pickle | |||
| G_pred_list = [] | |||
| for i in range(test_batch_size): | |||
| adj_pred = decode_adj(y_pred_long_data[i].cpu().numpy()) | |||
| G_pred = get_graph(adj_pred) # get a graph from zero-padded adj | |||
| G_pred_list.append(G_pred) | |||
| # save prediction histograms, plot histogram over each time step | |||
| # if save_histogram: | |||
| # save_prediction_histogram(y_pred_data.cpu().numpy(), | |||
| # fname_pred=args.figure_prediction_save_path+args.fname_pred+str(epoch)+'.jpg', | |||
| # max_num_node=max_num_node) | |||
| return G_pred_list | |||
| def test_vae_partial_epoch(epoch, args, rnn, output, data_loader, save_histogram=False,sample_time=1): | |||
| rnn.eval() | |||
| output.eval() | |||
| G_pred_list = [] | |||
| for batch_idx, data in enumerate(data_loader): | |||
| x = data['x'].float() | |||
| y = data['y'].float() | |||
| y_len = data['len'] | |||
| test_batch_size = x.size(0) | |||
| rnn.hidden = rnn.init_hidden(test_batch_size) | |||
| # generate graphs | |||
| max_num_node = int(args.max_num_node) | |||
| y_pred = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # normalized prediction score | |||
| y_pred_long = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # discrete prediction | |||
| x_step = Variable(torch.ones(test_batch_size,1,args.max_prev_node)).cuda() | |||
| for i in range(max_num_node): | |||
| print('finish node',i) | |||
| h = rnn(x_step) | |||
| y_pred_step, _, _ = output(h) | |||
| y_pred[:, i:i + 1, :] = F.sigmoid(y_pred_step) | |||
| x_step = sample_sigmoid_supervised(y_pred_step, y[:,i:i+1,:].cuda(), current=i, y_len=y_len, sample_time=sample_time) | |||
| y_pred_long[:, i:i + 1, :] = x_step | |||
| rnn.hidden = Variable(rnn.hidden.data).cuda() | |||
| y_pred_data = y_pred.data | |||
| y_pred_long_data = y_pred_long.data.long() | |||
| # save graphs as pickle | |||
| for i in range(test_batch_size): | |||
| adj_pred = decode_adj(y_pred_long_data[i].cpu().numpy()) | |||
| G_pred = get_graph(adj_pred) # get a graph from zero-padded adj | |||
| G_pred_list.append(G_pred) | |||
| return G_pred_list | |||
| def train_mlp_epoch(epoch, args, rnn, output, data_loader, | |||
| optimizer_rnn, optimizer_output, | |||
| scheduler_rnn, scheduler_output): | |||
| rnn.train() | |||
| output.train() | |||
| loss_sum = 0 | |||
| for batch_idx, data in enumerate(data_loader): | |||
| rnn.zero_grad() | |||
| output.zero_grad() | |||
| x_unsorted = data['x'].float() | |||
| y_unsorted = data['y'].float() | |||
| y_len_unsorted = data['len'] | |||
| y_len_max = max(y_len_unsorted) | |||
| x_unsorted = x_unsorted[:, 0:y_len_max, :] | |||
| y_unsorted = y_unsorted[:, 0:y_len_max, :] | |||
| # initialize lstm hidden state according to batch size | |||
| rnn.hidden = rnn.init_hidden(batch_size=x_unsorted.size(0)) | |||
| # sort input | |||
| y_len,sort_index = torch.sort(y_len_unsorted,0,descending=True) | |||
| y_len = y_len.numpy().tolist() | |||
| x = torch.index_select(x_unsorted,0,sort_index) | |||
| y = torch.index_select(y_unsorted,0,sort_index) | |||
| x = Variable(x).cuda() | |||
| y = Variable(y).cuda() | |||
| h = rnn(x, pack=True, input_len=y_len) | |||
| y_pred = output(h) | |||
| y_pred = F.sigmoid(y_pred) | |||
| # clean | |||
| y_pred = pack_padded_sequence(y_pred, y_len, batch_first=True) | |||
| y_pred = pad_packed_sequence(y_pred, batch_first=True)[0] | |||
| # use cross entropy loss | |||
| loss = binary_cross_entropy_weight(y_pred, y) | |||
| loss.backward() | |||
| # update deterministic and lstm | |||
| optimizer_output.step() | |||
| optimizer_rnn.step() | |||
| scheduler_output.step() | |||
| scheduler_rnn.step() | |||
| if epoch % args.epochs_log==0 and batch_idx==0: # only output first batch's statistics | |||
| print('Epoch: {}/{}, train loss: {:.6f}, graph type: {}, num_layer: {}, hidden: {}'.format( | |||
| epoch, args.epochs,loss.data[0], args.graph_type, args.num_layers, args.hidden_size_rnn)) | |||
| # logging | |||
| log_value('loss_'+args.fname, loss.data[0], epoch*args.batch_ratio+batch_idx) | |||
| loss_sum += loss.data[0] | |||
| return loss_sum/(batch_idx+1) | |||
| def test_mlp_epoch(epoch, args, rnn, output, test_batch_size=16, save_histogram=False,sample_time=1): | |||
| rnn.hidden = rnn.init_hidden(test_batch_size) | |||
| rnn.eval() | |||
| output.eval() | |||
| # generate graphs | |||
| max_num_node = int(args.max_num_node) | |||
| y_pred = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # normalized prediction score | |||
| y_pred_long = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # discrete prediction | |||
| x_step = Variable(torch.ones(test_batch_size,1,args.max_prev_node)).cuda() | |||
| for i in range(max_num_node): | |||
| h = rnn(x_step) | |||
| y_pred_step = output(h) | |||
| y_pred[:, i:i + 1, :] = F.sigmoid(y_pred_step) | |||
| x_step = sample_sigmoid(y_pred_step, sample=True, sample_time=sample_time) | |||
| y_pred_long[:, i:i + 1, :] = x_step | |||
| rnn.hidden = Variable(rnn.hidden.data).cuda() | |||
| y_pred_data = y_pred.data | |||
| y_pred_long_data = y_pred_long.data.long() | |||
| # save graphs as pickle | |||
| G_pred_list = [] | |||
| for i in range(test_batch_size): | |||
| adj_pred = decode_adj(y_pred_long_data[i].cpu().numpy()) | |||
| G_pred = get_graph(adj_pred) # get a graph from zero-padded adj | |||
| G_pred_list.append(G_pred) | |||
| # # save prediction histograms, plot histogram over each time step | |||
| # if save_histogram: | |||
| # save_prediction_histogram(y_pred_data.cpu().numpy(), | |||
| # fname_pred=args.figure_prediction_save_path+args.fname_pred+str(epoch)+'.jpg', | |||
| # max_num_node=max_num_node) | |||
| return G_pred_list | |||
| def test_mlp_partial_epoch(epoch, args, rnn, output, data_loader, save_histogram=False,sample_time=1): | |||
| rnn.eval() | |||
| output.eval() | |||
| G_pred_list = [] | |||
| for batch_idx, data in enumerate(data_loader): | |||
| x = data['x'].float() | |||
| y = data['y'].float() | |||
| y_len = data['len'] | |||
| test_batch_size = x.size(0) | |||
| rnn.hidden = rnn.init_hidden(test_batch_size) | |||
| # generate graphs | |||
| max_num_node = int(args.max_num_node) | |||
| y_pred = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # normalized prediction score | |||
| y_pred_long = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # discrete prediction | |||
| x_step = Variable(torch.ones(test_batch_size,1,args.max_prev_node)).cuda() | |||
| for i in range(max_num_node): | |||
| print('finish node',i) | |||
| h = rnn(x_step) | |||
| y_pred_step = output(h) | |||
| y_pred[:, i:i + 1, :] = F.sigmoid(y_pred_step) | |||
| x_step = sample_sigmoid_supervised(y_pred_step, y[:,i:i+1,:].cuda(), current=i, y_len=y_len, sample_time=sample_time) | |||
| y_pred_long[:, i:i + 1, :] = x_step | |||
| rnn.hidden = Variable(rnn.hidden.data).cuda() | |||
| y_pred_data = y_pred.data | |||
| y_pred_long_data = y_pred_long.data.long() | |||
| # save graphs as pickle | |||
| for i in range(test_batch_size): | |||
| adj_pred = decode_adj(y_pred_long_data[i].cpu().numpy()) | |||
| G_pred = get_graph(adj_pred) # get a graph from zero-padded adj | |||
| G_pred_list.append(G_pred) | |||
| return G_pred_list | |||
| def test_mlp_partial_simple_epoch(epoch, args, rnn, output, data_loader, save_histogram=False,sample_time=1): | |||
| rnn.eval() | |||
| output.eval() | |||
| G_pred_list = [] | |||
| for batch_idx, data in enumerate(data_loader): | |||
| x = data['x'].float() | |||
| y = data['y'].float() | |||
| y_len = data['len'] | |||
| test_batch_size = x.size(0) | |||
| rnn.hidden = rnn.init_hidden(test_batch_size) | |||
| # generate graphs | |||
| max_num_node = int(args.max_num_node) | |||
| y_pred = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # normalized prediction score | |||
| y_pred_long = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # discrete prediction | |||
| x_step = Variable(torch.ones(test_batch_size,1,args.max_prev_node)).cuda() | |||
| for i in range(max_num_node): | |||
| print('finish node',i) | |||
| h = rnn(x_step) | |||
| y_pred_step = output(h) | |||
| y_pred[:, i:i + 1, :] = F.sigmoid(y_pred_step) | |||
| x_step = sample_sigmoid_supervised_simple(y_pred_step, y[:,i:i+1,:].cuda(), current=i, y_len=y_len, sample_time=sample_time) | |||
| y_pred_long[:, i:i + 1, :] = x_step | |||
| rnn.hidden = Variable(rnn.hidden.data).cuda() | |||
| y_pred_data = y_pred.data | |||
| y_pred_long_data = y_pred_long.data.long() | |||
| # save graphs as pickle | |||
| for i in range(test_batch_size): | |||
| adj_pred = decode_adj(y_pred_long_data[i].cpu().numpy()) | |||
| G_pred = get_graph(adj_pred) # get a graph from zero-padded adj | |||
| G_pred_list.append(G_pred) | |||
| return G_pred_list | |||
| def train_mlp_forward_epoch(epoch, args, rnn, output, data_loader): | |||
| rnn.train() | |||
| output.train() | |||
| loss_sum = 0 | |||
| for batch_idx, data in enumerate(data_loader): | |||
| rnn.zero_grad() | |||
| output.zero_grad() | |||
| x_unsorted = data['x'].float() | |||
| y_unsorted = data['y'].float() | |||
| y_len_unsorted = data['len'] | |||
| y_len_max = max(y_len_unsorted) | |||
| x_unsorted = x_unsorted[:, 0:y_len_max, :] | |||
| y_unsorted = y_unsorted[:, 0:y_len_max, :] | |||
| # initialize lstm hidden state according to batch size | |||
| rnn.hidden = rnn.init_hidden(batch_size=x_unsorted.size(0)) | |||
| # sort input | |||
| y_len,sort_index = torch.sort(y_len_unsorted,0,descending=True) | |||
| y_len = y_len.numpy().tolist() | |||
| x = torch.index_select(x_unsorted,0,sort_index) | |||
| y = torch.index_select(y_unsorted,0,sort_index) | |||
| x = Variable(x).cuda() | |||
| y = Variable(y).cuda() | |||
| h = rnn(x, pack=True, input_len=y_len) | |||
| y_pred = output(h) | |||
| y_pred = F.sigmoid(y_pred) | |||
| # clean | |||
| y_pred = pack_padded_sequence(y_pred, y_len, batch_first=True) | |||
| y_pred = pad_packed_sequence(y_pred, batch_first=True)[0] | |||
| # use cross entropy loss | |||
| loss = 0 | |||
| for j in range(y.size(1)): | |||
| # print('y_pred',y_pred[0,j,:],'y',y[0,j,:]) | |||
| end_idx = min(j+1,y.size(2)) | |||
| loss += binary_cross_entropy_weight(y_pred[:,j,0:end_idx], y[:,j,0:end_idx])*end_idx | |||
| if epoch % args.epochs_log==0 and batch_idx==0: # only output first batch's statistics | |||
| print('Epoch: {}/{}, train loss: {:.6f}, graph type: {}, num_layer: {}, hidden: {}'.format( | |||
| epoch, args.epochs,loss.data[0], args.graph_type, args.num_layers, args.hidden_size_rnn)) | |||
| # logging | |||
| log_value('loss_'+args.fname, loss.data[0], epoch*args.batch_ratio+batch_idx) | |||
| loss_sum += loss.data[0] | |||
| return loss_sum/(batch_idx+1) | |||
| ## too complicated, deprecated | |||
| # def test_mlp_partial_bfs_epoch(epoch, args, rnn, output, data_loader, save_histogram=False,sample_time=1): | |||
| # rnn.eval() | |||
| # output.eval() | |||
| # G_pred_list = [] | |||
| # for batch_idx, data in enumerate(data_loader): | |||
| # x = data['x'].float() | |||
| # y = data['y'].float() | |||
| # y_len = data['len'] | |||
| # test_batch_size = x.size(0) | |||
| # rnn.hidden = rnn.init_hidden(test_batch_size) | |||
| # # generate graphs | |||
| # max_num_node = int(args.max_num_node) | |||
| # y_pred = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # normalized prediction score | |||
| # y_pred_long = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # discrete prediction | |||
| # x_step = Variable(torch.ones(test_batch_size,1,args.max_prev_node)).cuda() | |||
| # for i in range(max_num_node): | |||
| # # 1 back up hidden state | |||
| # hidden_prev = Variable(rnn.hidden.data).cuda() | |||
| # h = rnn(x_step) | |||
| # y_pred_step = output(h) | |||
| # y_pred[:, i:i + 1, :] = F.sigmoid(y_pred_step) | |||
| # x_step = sample_sigmoid_supervised(y_pred_step, y[:,i:i+1,:].cuda(), current=i, y_len=y_len, sample_time=sample_time) | |||
| # y_pred_long[:, i:i + 1, :] = x_step | |||
| # | |||
| # rnn.hidden = Variable(rnn.hidden.data).cuda() | |||
| # | |||
| # print('finish node', i) | |||
| # y_pred_data = y_pred.data | |||
| # y_pred_long_data = y_pred_long.data.long() | |||
| # | |||
| # # save graphs as pickle | |||
| # for i in range(test_batch_size): | |||
| # adj_pred = decode_adj(y_pred_long_data[i].cpu().numpy()) | |||
| # G_pred = get_graph(adj_pred) # get a graph from zero-padded adj | |||
| # G_pred_list.append(G_pred) | |||
| # return G_pred_list | |||
| def train_rnn_epoch(epoch, args, rnn, output, data_loader, | |||
| optimizer_rnn, optimizer_output, | |||
| scheduler_rnn, scheduler_output): | |||
| rnn.train() | |||
| output.train() | |||
| loss_sum = 0 | |||
| for batch_idx, data in enumerate(data_loader): | |||
| rnn.zero_grad() | |||
| output.zero_grad() | |||
| x_unsorted = data['x'].float() | |||
| y_unsorted = data['y'].float() | |||
| y_len_unsorted = data['len'] | |||
| y_len_max = max(y_len_unsorted) | |||
| x_unsorted = x_unsorted[:, 0:y_len_max, :] | |||
| y_unsorted = y_unsorted[:, 0:y_len_max, :] | |||
| # initialize lstm hidden state according to batch size | |||
| rnn.hidden = rnn.init_hidden(batch_size=x_unsorted.size(0)) | |||
| # output.hidden = output.init_hidden(batch_size=x_unsorted.size(0)*x_unsorted.size(1)) | |||
| # sort input | |||
| y_len,sort_index = torch.sort(y_len_unsorted,0,descending=True) | |||
| y_len = y_len.numpy().tolist() | |||
| x = torch.index_select(x_unsorted,0,sort_index) | |||
| y = torch.index_select(y_unsorted,0,sort_index) | |||
| # input, output for output rnn module | |||
| # a smart use of pytorch builtin function: pack variable--b1_l1,b2_l1,...,b1_l2,b2_l2,... | |||
| y_reshape = pack_padded_sequence(y,y_len,batch_first=True).data | |||
| # reverse y_reshape, so that their lengths are sorted, add dimension | |||
| idx = [i for i in range(y_reshape.size(0)-1, -1, -1)] | |||
| idx = torch.LongTensor(idx) | |||
| y_reshape = y_reshape.index_select(0, idx) | |||
| y_reshape = y_reshape.view(y_reshape.size(0),y_reshape.size(1),1) | |||
| output_x = torch.cat((torch.ones(y_reshape.size(0),1,1),y_reshape[:,0:-1,0:1]),dim=1) | |||
| output_y = y_reshape | |||
| # batch size for output module: sum(y_len) | |||
| output_y_len = [] | |||
| output_y_len_bin = np.bincount(np.array(y_len)) | |||
| for i in range(len(output_y_len_bin)-1,0,-1): | |||
| count_temp = np.sum(output_y_len_bin[i:]) # count how many y_len is above i | |||
| output_y_len.extend([min(i,y.size(2))]*count_temp) # put them in output_y_len; max value should not exceed y.size(2) | |||
| # pack into variable | |||
| x = Variable(x).cuda() | |||
| y = Variable(y).cuda() | |||
| output_x = Variable(output_x).cuda() | |||
| output_y = Variable(output_y).cuda() | |||
| # print(output_y_len) | |||
| # print('len',len(output_y_len)) | |||
| # print('y',y.size()) | |||
| # print('output_y',output_y.size()) | |||
| # if using ground truth to train | |||
| h = rnn(x, pack=True, input_len=y_len) | |||
| h = pack_padded_sequence(h,y_len,batch_first=True).data # get packed hidden vector | |||
| # reverse h | |||
| idx = [i for i in range(h.size(0) - 1, -1, -1)] | |||
| idx = Variable(torch.LongTensor(idx)).cuda() | |||
| h = h.index_select(0, idx) | |||
| hidden_null = Variable(torch.zeros(args.num_layers-1, h.size(0), h.size(1))).cuda() | |||
| output.hidden = torch.cat((h.view(1,h.size(0),h.size(1)),hidden_null),dim=0) # num_layers, batch_size, hidden_size | |||
| y_pred = output(output_x, pack=True, input_len=output_y_len) | |||
| y_pred = F.sigmoid(y_pred) | |||
| # clean | |||
| y_pred = pack_padded_sequence(y_pred, output_y_len, batch_first=True) | |||
| y_pred = pad_packed_sequence(y_pred, batch_first=True)[0] | |||
| output_y = pack_padded_sequence(output_y,output_y_len,batch_first=True) | |||
| output_y = pad_packed_sequence(output_y,batch_first=True)[0] | |||
| # use cross entropy loss | |||
| loss = binary_cross_entropy_weight(y_pred, output_y) | |||
| loss.backward() | |||
| # update deterministic and lstm | |||
| optimizer_output.step() | |||
| optimizer_rnn.step() | |||
| scheduler_output.step() | |||
| scheduler_rnn.step() | |||
| if epoch % args.epochs_log==0 and batch_idx==0: # only output first batch's statistics | |||
| print('Epoch: {}/{}, train loss: {:.6f}, graph type: {}, num_layer: {}, hidden: {}'.format( | |||
| epoch, args.epochs,loss.data[0], args.graph_type, args.num_layers, args.hidden_size_rnn)) | |||
| # logging | |||
| log_value('loss_'+args.fname, loss.data[0], epoch*args.batch_ratio+batch_idx) | |||
| feature_dim = y.size(1)*y.size(2) | |||
| loss_sum += loss.data[0]*feature_dim | |||
| return loss_sum/(batch_idx+1) | |||
| def test_rnn_epoch(epoch, args, rnn, output, test_batch_size=16): | |||
| rnn.hidden = rnn.init_hidden(test_batch_size) | |||
| rnn.eval() | |||
| output.eval() | |||
| # generate graphs | |||
| max_num_node = int(args.max_num_node) | |||
| y_pred_long = Variable(torch.zeros(test_batch_size, max_num_node, args.max_prev_node)).cuda() # discrete prediction | |||
| x_step = Variable(torch.ones(test_batch_size,1,args.max_prev_node)).cuda() | |||
| for i in range(max_num_node): | |||
| h = rnn(x_step) | |||
| # output.hidden = h.permute(1,0,2) | |||
| hidden_null = Variable(torch.zeros(args.num_layers - 1, h.size(0), h.size(2))).cuda() | |||
| output.hidden = torch.cat((h.permute(1,0,2), hidden_null), | |||
| dim=0) # num_layers, batch_size, hidden_size | |||
| x_step = Variable(torch.zeros(test_batch_size,1,args.max_prev_node)).cuda() | |||
| output_x_step = Variable(torch.ones(test_batch_size,1,1)).cuda() | |||
| for j in range(min(args.max_prev_node,i+1)): | |||
| output_y_pred_step = output(output_x_step) | |||
| output_x_step = sample_sigmoid(output_y_pred_step, sample=True, sample_time=1) | |||
| x_step[:,:,j:j+1] = output_x_step | |||
| output.hidden = Variable(output.hidden.data).cuda() | |||
| y_pred_long[:, i:i + 1, :] = x_step | |||
| rnn.hidden = Variable(rnn.hidden.data).cuda() | |||
| y_pred_long_data = y_pred_long.data.long() | |||
| # save graphs as pickle | |||
| G_pred_list = [] | |||
| for i in range(test_batch_size): | |||
| adj_pred = decode_adj(y_pred_long_data[i].cpu().numpy()) | |||
| G_pred = get_graph(adj_pred) # get a graph from zero-padded adj | |||
| G_pred_list.append(G_pred) | |||
| return G_pred_list | |||
| def train_rnn_forward_epoch(epoch, args, rnn, output, data_loader): | |||
| rnn.train() | |||
| output.train() | |||
| loss_sum = 0 | |||
| for batch_idx, data in enumerate(data_loader): | |||
| rnn.zero_grad() | |||
| output.zero_grad() | |||
| x_unsorted = data['x'].float() | |||
| y_unsorted = data['y'].float() | |||
| y_len_unsorted = data['len'] | |||
| y_len_max = max(y_len_unsorted) | |||
| x_unsorted = x_unsorted[:, 0:y_len_max, :] | |||
| y_unsorted = y_unsorted[:, 0:y_len_max, :] | |||
| # initialize lstm hidden state according to batch size | |||
| rnn.hidden = rnn.init_hidden(batch_size=x_unsorted.size(0)) | |||
| # output.hidden = output.init_hidden(batch_size=x_unsorted.size(0)*x_unsorted.size(1)) | |||
| # sort input | |||
| y_len,sort_index = torch.sort(y_len_unsorted,0,descending=True) | |||
| y_len = y_len.numpy().tolist() | |||
| x = torch.index_select(x_unsorted,0,sort_index) | |||
| y = torch.index_select(y_unsorted,0,sort_index) | |||
| # input, output for output rnn module | |||
| # a smart use of pytorch builtin function: pack variable--b1_l1,b2_l1,...,b1_l2,b2_l2,... | |||
| y_reshape = pack_padded_sequence(y,y_len,batch_first=True).data | |||
| # reverse y_reshape, so that their lengths are sorted, add dimension | |||
| idx = [i for i in range(y_reshape.size(0)-1, -1, -1)] | |||
| idx = torch.LongTensor(idx) | |||
| y_reshape = y_reshape.index_select(0, idx) | |||
| y_reshape = y_reshape.view(y_reshape.size(0),y_reshape.size(1),1) | |||
| output_x = torch.cat((torch.ones(y_reshape.size(0),1,1),y_reshape[:,0:-1,0:1]),dim=1) | |||
| output_y = y_reshape | |||
| # batch size for output module: sum(y_len) | |||
| output_y_len = [] | |||
| output_y_len_bin = np.bincount(np.array(y_len)) | |||
| for i in range(len(output_y_len_bin)-1,0,-1): | |||
| count_temp = np.sum(output_y_len_bin[i:]) # count how many y_len is above i | |||
| output_y_len.extend([min(i,y.size(2))]*count_temp) # put them in output_y_len; max value should not exceed y.size(2) | |||
| # pack into variable | |||
| x = Variable(x).cuda() | |||
| y = Variable(y).cuda() | |||
| output_x = Variable(output_x).cuda() | |||
| output_y = Variable(output_y).cuda() | |||
| # print(output_y_len) | |||
| # print('len',len(output_y_len)) | |||
| # print('y',y.size()) | |||
| # print('output_y',output_y.size()) | |||
| # if using ground truth to train | |||
| h = rnn(x, pack=True, input_len=y_len) | |||
| h = pack_padded_sequence(h,y_len,batch_first=True).data # get packed hidden vector | |||
| # reverse h | |||
| idx = [i for i in range(h.size(0) - 1, -1, -1)] | |||
| idx = Variable(torch.LongTensor(idx)).cuda() | |||
| h = h.index_select(0, idx) | |||
| hidden_null = Variable(torch.zeros(args.num_layers-1, h.size(0), h.size(1))).cuda() | |||
| output.hidden = torch.cat((h.view(1,h.size(0),h.size(1)),hidden_null),dim=0) # num_layers, batch_size, hidden_size | |||
| y_pred = output(output_x, pack=True, input_len=output_y_len) | |||
| y_pred = F.sigmoid(y_pred) | |||
| # clean | |||
| y_pred = pack_padded_sequence(y_pred, output_y_len, batch_first=True) | |||
| y_pred = pad_packed_sequence(y_pred, batch_first=True)[0] | |||
| output_y = pack_padded_sequence(output_y,output_y_len,batch_first=True) | |||
| output_y = pad_packed_sequence(output_y,batch_first=True)[0] | |||
| # use cross entropy loss | |||
| loss = binary_cross_entropy_weight(y_pred, output_y) | |||
| if epoch % args.epochs_log==0 and batch_idx==0: # only output first batch's statistics | |||
| print('Epoch: {}/{}, train loss: {:.6f}, graph type: {}, num_layer: {}, hidden: {}'.format( | |||
| epoch, args.epochs,loss.data[0], args.graph_type, args.num_layers, args.hidden_size_rnn)) | |||
| # logging | |||
| log_value('loss_'+args.fname, loss.data[0], epoch*args.batch_ratio+batch_idx) | |||
| # print(y_pred.size()) | |||
| feature_dim = y_pred.size(0)*y_pred.size(1) | |||
| loss_sum += loss.data[0]*feature_dim/y.size(0) | |||
| return loss_sum/(batch_idx+1) | |||
| ########### train function for LSTM + VAE | |||
| def train(args, dataset_train, rnn, output): | |||
| # check if load existing model | |||
| if args.load: | |||
| fname = args.model_save_path + args.fname + 'lstm_' + str(args.load_epoch) + '.dat' | |||
| rnn.load_state_dict(torch.load(fname)) | |||
| fname = args.model_save_path + args.fname + 'output_' + str(args.load_epoch) + '.dat' | |||
| output.load_state_dict(torch.load(fname)) | |||
| args.lr = 0.00001 | |||
| epoch = args.load_epoch | |||
| print('model loaded!, lr: {}'.format(args.lr)) | |||
| else: | |||
| epoch = 1 | |||
| # initialize optimizer | |||
| optimizer_rnn = optim.Adam(list(rnn.parameters()), lr=args.lr) | |||
| optimizer_output = optim.Adam(list(output.parameters()), lr=args.lr) | |||
| scheduler_rnn = MultiStepLR(optimizer_rnn, milestones=args.milestones, gamma=args.lr_rate) | |||
| scheduler_output = MultiStepLR(optimizer_output, milestones=args.milestones, gamma=args.lr_rate) | |||
| # start main loop | |||
| time_all = np.zeros(args.epochs) | |||
| while epoch<=args.epochs: | |||
| time_start = tm.time() | |||
| # train | |||
| if 'GraphRNN_VAE' in args.note: | |||
| train_vae_epoch(epoch, args, rnn, output, dataset_train, | |||
| optimizer_rnn, optimizer_output, | |||
| scheduler_rnn, scheduler_output) | |||
| elif 'GraphRNN_MLP' in args.note: | |||
| train_mlp_epoch(epoch, args, rnn, output, dataset_train, | |||
| optimizer_rnn, optimizer_output, | |||
| scheduler_rnn, scheduler_output) | |||
| elif 'GraphRNN_RNN' in args.note: | |||
| train_rnn_epoch(epoch, args, rnn, output, dataset_train, | |||
| optimizer_rnn, optimizer_output, | |||
| scheduler_rnn, scheduler_output) | |||
| time_end = tm.time() | |||
| time_all[epoch - 1] = time_end - time_start | |||
| # test | |||
| if epoch % args.epochs_test == 0 and epoch>=args.epochs_test_start: | |||
| for sample_time in range(1,4): | |||
| G_pred = [] | |||
| while len(G_pred)<args.test_total_size: | |||
| if 'GraphRNN_VAE' in args.note: | |||
| G_pred_step = test_vae_epoch(epoch, args, rnn, output, test_batch_size=args.test_batch_size,sample_time=sample_time) | |||
| elif 'GraphRNN_MLP' in args.note: | |||
| G_pred_step = test_mlp_epoch(epoch, args, rnn, output, test_batch_size=args.test_batch_size,sample_time=sample_time) | |||
| elif 'GraphRNN_RNN' in args.note: | |||
| G_pred_step = test_rnn_epoch(epoch, args, rnn, output, test_batch_size=args.test_batch_size) | |||
| G_pred.extend(G_pred_step) | |||
| # save graphs | |||
| fname = args.graph_save_path + args.fname_pred + str(epoch) +'_'+str(sample_time) + '.dat' | |||
| save_graph_list(G_pred, fname) | |||
| if 'GraphRNN_RNN' in args.note: | |||
| break | |||
| print('test done, graphs saved') | |||
| # save model checkpoint | |||
| if args.save: | |||
| if epoch % args.epochs_save == 0: | |||
| fname = args.model_save_path + args.fname + 'lstm_' + str(epoch) + '.dat' | |||
| torch.save(rnn.state_dict(), fname) | |||
| fname = args.model_save_path + args.fname + 'output_' + str(epoch) + '.dat' | |||
| torch.save(output.state_dict(), fname) | |||
| epoch += 1 | |||
| np.save(args.timing_save_path+args.fname,time_all) | |||
| ########### for graph completion task | |||
| def train_graph_completion(args, dataset_test, rnn, output): | |||
| fname = args.model_save_path + args.fname + 'lstm_' + str(args.load_epoch) + '.dat' | |||
| rnn.load_state_dict(torch.load(fname)) | |||
| fname = args.model_save_path + args.fname + 'output_' + str(args.load_epoch) + '.dat' | |||
| output.load_state_dict(torch.load(fname)) | |||
| epoch = args.load_epoch | |||
| print('model loaded!, epoch: {}'.format(args.load_epoch)) | |||
| for sample_time in range(1,4): | |||
| if 'GraphRNN_MLP' in args.note: | |||
| G_pred = test_mlp_partial_simple_epoch(epoch, args, rnn, output, dataset_test,sample_time=sample_time) | |||
| if 'GraphRNN_VAE' in args.note: | |||
| G_pred = test_vae_partial_epoch(epoch, args, rnn, output, dataset_test,sample_time=sample_time) | |||
| # save graphs | |||
| fname = args.graph_save_path + args.fname_pred + str(epoch) +'_'+str(sample_time) + 'graph_completion.dat' | |||
| save_graph_list(G_pred, fname) | |||
| print('graph completion done, graphs saved') | |||
| ########### for NLL evaluation | |||
| def train_nll(args, dataset_train, dataset_test, rnn, output,graph_validate_len,graph_test_len, max_iter = 1000): | |||
| fname = args.model_save_path + args.fname + 'lstm_' + str(args.load_epoch) + '.dat' | |||
| rnn.load_state_dict(torch.load(fname)) | |||
| fname = args.model_save_path + args.fname + 'output_' + str(args.load_epoch) + '.dat' | |||
| output.load_state_dict(torch.load(fname)) | |||
| epoch = args.load_epoch | |||
| print('model loaded!, epoch: {}'.format(args.load_epoch)) | |||
| fname_output = args.nll_save_path + args.note + '_' + args.graph_type + '.csv' | |||
| with open(fname_output, 'w+') as f: | |||
| f.write(str(graph_validate_len)+','+str(graph_test_len)+'\n') | |||
| f.write('train,test\n') | |||
| for iter in range(max_iter): | |||
| if 'GraphRNN_MLP' in args.note: | |||
| nll_train = train_mlp_forward_epoch(epoch, args, rnn, output, dataset_train) | |||
| nll_test = train_mlp_forward_epoch(epoch, args, rnn, output, dataset_test) | |||
| if 'GraphRNN_RNN' in args.note: | |||
| nll_train = train_rnn_forward_epoch(epoch, args, rnn, output, dataset_train) | |||
| nll_test = train_rnn_forward_epoch(epoch, args, rnn, output, dataset_test) | |||
| print('train',nll_train,'test',nll_test) | |||
| f.write(str(nll_train)+','+str(nll_test)+'\n') | |||
| print('NLL evaluation done') | |||
| @@ -0,0 +1,518 @@ | |||
| import networkx as nx | |||
| import numpy as np | |||
| import torch | |||
| import torch.nn as nn | |||
| import torch.nn.init as init | |||
| from torch.autograd import Variable | |||
| import matplotlib.pyplot as plt | |||
| import torch.nn.functional as F | |||
| from torch import optim | |||
| from torch.optim.lr_scheduler import MultiStepLR | |||
| # import node2vec.src.main as nv | |||
| from sklearn.decomposition import PCA | |||
| import community | |||
| import pickle | |||
| import re | |||
| import data | |||
| def citeseer_ego(): | |||
| _, _, G = data.Graph_load(dataset='citeseer') | |||
| G = max(nx.connected_component_subgraphs(G), key=len) | |||
| G = nx.convert_node_labels_to_integers(G) | |||
| graphs = [] | |||
| for i in range(G.number_of_nodes()): | |||
| G_ego = nx.ego_graph(G, i, radius=3) | |||
| if G_ego.number_of_nodes() >= 50 and (G_ego.number_of_nodes() <= 400): | |||
| graphs.append(G_ego) | |||
| return graphs | |||
| def caveman_special(c=2,k=20,p_path=0.1,p_edge=0.3): | |||
| p = p_path | |||
| path_count = max(int(np.ceil(p * k)),1) | |||
| G = nx.caveman_graph(c, k) | |||
| # remove 50% edges | |||
| p = 1-p_edge | |||
| for (u, v) in list(G.edges()): | |||
| if np.random.rand() < p and ((u < k and v < k) or (u >= k and v >= k)): | |||
| G.remove_edge(u, v) | |||
| # add path_count links | |||
| for i in range(path_count): | |||
| u = np.random.randint(0, k) | |||
| v = np.random.randint(k, k * 2) | |||
| G.add_edge(u, v) | |||
| G = max(nx.connected_component_subgraphs(G), key=len) | |||
| return G | |||
| def n_community(c_sizes, p_inter=0.01): | |||
| graphs = [nx.gnp_random_graph(c_sizes[i], 0.7, seed=i) for i in range(len(c_sizes))] | |||
| G = nx.disjoint_union_all(graphs) | |||
| communities = list(nx.connected_component_subgraphs(G)) | |||
| for i in range(len(communities)): | |||
| subG1 = communities[i] | |||
| nodes1 = list(subG1.nodes()) | |||
| for j in range(i+1, len(communities)): | |||
| subG2 = communities[j] | |||
| nodes2 = list(subG2.nodes()) | |||
| has_inter_edge = False | |||
| for n1 in nodes1: | |||
| for n2 in nodes2: | |||
| if np.random.rand() < p_inter: | |||
| G.add_edge(n1, n2) | |||
| has_inter_edge = True | |||
| if not has_inter_edge: | |||
| G.add_edge(nodes1[0], nodes2[0]) | |||
| #print('connected comp: ', len(list(nx.connected_component_subgraphs(G)))) | |||
| return G | |||
| def perturb(graph_list, p_del, p_add=None): | |||
| ''' Perturb the list of graphs by adding/removing edges. | |||
| Args: | |||
| p_add: probability of adding edges. If None, estimate it according to graph density, | |||
| such that the expected number of added edges is equal to that of deleted edges. | |||
| p_del: probability of removing edges | |||
| Returns: | |||
| A list of graphs that are perturbed from the original graphs | |||
| ''' | |||
| perturbed_graph_list = [] | |||
| for G_original in graph_list: | |||
| G = G_original.copy() | |||
| trials = np.random.binomial(1, p_del, size=G.number_of_edges()) | |||
| edges = list(G.edges()) | |||
| i = 0 | |||
| for (u, v) in edges: | |||
| if trials[i] == 1: | |||
| G.remove_edge(u, v) | |||
| i += 1 | |||
| if p_add is None: | |||
| num_nodes = G.number_of_nodes() | |||
| p_add_est = np.sum(trials) / (num_nodes * (num_nodes - 1) / 2 - | |||
| G.number_of_edges()) | |||
| else: | |||
| p_add_est = p_add | |||
| nodes = list(G.nodes()) | |||
| tmp = 0 | |||
| for i in range(len(nodes)): | |||
| u = nodes[i] | |||
| trials = np.random.binomial(1, p_add_est, size=G.number_of_nodes()) | |||
| j = 0 | |||
| for j in range(i+1, len(nodes)): | |||
| v = nodes[j] | |||
| if trials[j] == 1: | |||
| tmp += 1 | |||
| G.add_edge(u, v) | |||
| j += 1 | |||
| perturbed_graph_list.append(G) | |||
| return perturbed_graph_list | |||
| def perturb_new(graph_list, p): | |||
| ''' Perturb the list of graphs by adding/removing edges. | |||
| Args: | |||
| p_add: probability of adding edges. If None, estimate it according to graph density, | |||
| such that the expected number of added edges is equal to that of deleted edges. | |||
| p_del: probability of removing edges | |||
| Returns: | |||
| A list of graphs that are perturbed from the original graphs | |||
| ''' | |||
| perturbed_graph_list = [] | |||
| for G_original in graph_list: | |||
| G = G_original.copy() | |||
| edge_remove_count = 0 | |||
| for (u, v) in list(G.edges()): | |||
| if np.random.rand()<p: | |||
| G.remove_edge(u, v) | |||
| edge_remove_count += 1 | |||
| # randomly add the edges back | |||
| for i in range(edge_remove_count): | |||
| while True: | |||
| u = np.random.randint(0, G.number_of_nodes()) | |||
| v = np.random.randint(0, G.number_of_nodes()) | |||
| if (not G.has_edge(u,v)) and (u!=v): | |||
| break | |||
| G.add_edge(u, v) | |||
| perturbed_graph_list.append(G) | |||
| return perturbed_graph_list | |||
| def imsave(fname, arr, vmin=None, vmax=None, cmap=None, format=None, origin=None): | |||
| from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas | |||
| from matplotlib.figure import Figure | |||
| fig = Figure(figsize=arr.shape[::-1], dpi=1, frameon=False) | |||
| canvas = FigureCanvas(fig) | |||
| fig.figimage(arr, cmap=cmap, vmin=vmin, vmax=vmax, origin=origin) | |||
| fig.savefig(fname, dpi=1, format=format) | |||
| def save_prediction_histogram(y_pred_data, fname_pred, max_num_node, bin_n=20): | |||
| bin_edge = np.linspace(1e-6, 1, bin_n + 1) | |||
| output_pred = np.zeros((bin_n, max_num_node)) | |||
| for i in range(max_num_node): | |||
| output_pred[:, i], _ = np.histogram(y_pred_data[:, i, :], bins=bin_edge, density=False) | |||
| # normalize | |||
| output_pred[:, i] /= np.sum(output_pred[:, i]) | |||
| imsave(fname=fname_pred, arr=output_pred, origin='upper', cmap='Greys_r', vmin=0.0, vmax=3.0 / bin_n) | |||
| # draw a single graph G | |||
| def draw_graph(G, prefix = 'test'): | |||
| parts = community.best_partition(G) | |||
| values = [parts.get(node) for node in G.nodes()] | |||
| colors = [] | |||
| for i in range(len(values)): | |||
| if values[i] == 0: | |||
| colors.append('red') | |||
| if values[i] == 1: | |||
| colors.append('green') | |||
| if values[i] == 2: | |||
| colors.append('blue') | |||
| if values[i] == 3: | |||
| colors.append('yellow') | |||
| if values[i] == 4: | |||
| colors.append('orange') | |||
| if values[i] == 5: | |||
| colors.append('pink') | |||
| if values[i] == 6: | |||
| colors.append('black') | |||
| # spring_pos = nx.spring_layout(G) | |||
| plt.switch_backend('agg') | |||
| plt.axis("off") | |||
| pos = nx.spring_layout(G) | |||
| nx.draw_networkx(G, with_labels=True, node_size=35, node_color=colors,pos=pos) | |||
| # plt.switch_backend('agg') | |||
| # options = { | |||
| # 'node_color': 'black', | |||
| # 'node_size': 10, | |||
| # 'width': 1 | |||
| # } | |||
| # plt.figure() | |||
| # plt.subplot() | |||
| # nx.draw_networkx(G, **options) | |||
| plt.savefig('figures/graph_view_'+prefix+'.png', dpi=200) | |||
| plt.close() | |||
| plt.switch_backend('agg') | |||
| G_deg = nx.degree_histogram(G) | |||
| G_deg = np.array(G_deg) | |||
| # plt.plot(range(len(G_deg)), G_deg, 'r', linewidth = 2) | |||
| plt.loglog(np.arange(len(G_deg))[G_deg>0], G_deg[G_deg>0], 'r', linewidth=2) | |||
| plt.savefig('figures/degree_view_' + prefix + '.png', dpi=200) | |||
| plt.close() | |||
| # degree_sequence = sorted(nx.degree(G).values(), reverse=True) # degree sequence | |||
| # plt.loglog(degree_sequence, 'b-', marker='o') | |||
| # plt.title("Degree rank plot") | |||
| # plt.ylabel("degree") | |||
| # plt.xlabel("rank") | |||
| # plt.savefig('figures/degree_view_' + prefix + '.png', dpi=200) | |||
| # plt.close() | |||
| # G = nx.grid_2d_graph(8,8) | |||
| # G = nx.karate_club_graph() | |||
| # draw_graph(G) | |||
| # draw a list of graphs [G] | |||
| def draw_graph_list(G_list, row, col, fname = 'figures/test', layout='spring', is_single=False,k=1,node_size=55,alpha=1,width=1.3): | |||
| # # draw graph view | |||
| # from pylab import rcParams | |||
| # rcParams['figure.figsize'] = 12,3 | |||
| plt.switch_backend('agg') | |||
| for i,G in enumerate(G_list): | |||
| plt.subplot(row,col,i+1) | |||
| plt.subplots_adjust(left=0, bottom=0, right=1, top=1, | |||
| wspace=0, hspace=0) | |||
| # if i%2==0: | |||
| # plt.title('real nodes: '+str(G.number_of_nodes()), fontsize = 4) | |||
| # else: | |||
| # plt.title('pred nodes: '+str(G.number_of_nodes()), fontsize = 4) | |||
| # plt.title('num of nodes: '+str(G.number_of_nodes()), fontsize = 4) | |||
| # parts = community.best_partition(G) | |||
| # values = [parts.get(node) for node in G.nodes()] | |||
| # colors = [] | |||
| # for i in range(len(values)): | |||
| # if values[i] == 0: | |||
| # colors.append('red') | |||
| # if values[i] == 1: | |||
| # colors.append('green') | |||
| # if values[i] == 2: | |||
| # colors.append('blue') | |||
| # if values[i] == 3: | |||
| # colors.append('yellow') | |||
| # if values[i] == 4: | |||
| # colors.append('orange') | |||
| # if values[i] == 5: | |||
| # colors.append('pink') | |||
| # if values[i] == 6: | |||
| # colors.append('black') | |||
| plt.axis("off") | |||
| if layout=='spring': | |||
| pos = nx.spring_layout(G,k=k/np.sqrt(G.number_of_nodes()),iterations=100) | |||
| # pos = nx.spring_layout(G) | |||
| elif layout=='spectral': | |||
| pos = nx.spectral_layout(G) | |||
| # # nx.draw_networkx(G, with_labels=True, node_size=2, width=0.15, font_size = 1.5, node_color=colors,pos=pos) | |||
| # nx.draw_networkx(G, with_labels=False, node_size=1.5, width=0.2, font_size = 1.5, linewidths=0.2, node_color = 'k',pos=pos,alpha=0.2) | |||
| if is_single: | |||
| # node_size default 60, edge_width default 1.5 | |||
| nx.draw_networkx_nodes(G, pos, node_size=node_size, node_color='#336699', alpha=1, linewidths=0, font_size=0) | |||
| nx.draw_networkx_edges(G, pos, alpha=alpha, width=width) | |||
| else: | |||
| nx.draw_networkx_nodes(G, pos, node_size=1.5, node_color='#336699',alpha=1, linewidths=0.2, font_size = 1.5) | |||
| nx.draw_networkx_edges(G, pos, alpha=0.3,width=0.2) | |||
| # plt.axis('off') | |||
| # plt.title('Complete Graph of Odd-degree Nodes') | |||
| # plt.show() | |||
| plt.tight_layout() | |||
| plt.savefig(fname+'.png', dpi=600) | |||
| plt.close() | |||
| # # draw degree distribution | |||
| # plt.switch_backend('agg') | |||
| # for i, G in enumerate(G_list): | |||
| # plt.subplot(row, col, i + 1) | |||
| # G_deg = np.array(list(G.degree(G.nodes()).values())) | |||
| # bins = np.arange(20) | |||
| # plt.hist(np.array(G_deg), bins=bins, align='left') | |||
| # plt.xlabel('degree', fontsize = 3) | |||
| # plt.ylabel('count', fontsize = 3) | |||
| # G_deg_mean = 2*G.number_of_edges()/float(G.number_of_nodes()) | |||
| # # if i % 2 == 0: | |||
| # # plt.title('real average degree: {:.2f}'.format(G_deg_mean), fontsize=4) | |||
| # # else: | |||
| # # plt.title('pred average degree: {:.2f}'.format(G_deg_mean), fontsize=4) | |||
| # plt.title('average degree: {:.2f}'.format(G_deg_mean), fontsize=4) | |||
| # plt.tick_params(axis='both', which='major', labelsize=3) | |||
| # plt.tick_params(axis='both', which='minor', labelsize=3) | |||
| # plt.tight_layout() | |||
| # plt.savefig(fname+'_degree.png', dpi=600) | |||
| # plt.close() | |||
| # | |||
| # # draw clustering distribution | |||
| # plt.switch_backend('agg') | |||
| # for i, G in enumerate(G_list): | |||
| # plt.subplot(row, col, i + 1) | |||
| # G_cluster = list(nx.clustering(G).values()) | |||
| # bins = np.linspace(0,1,20) | |||
| # plt.hist(np.array(G_cluster), bins=bins, align='left') | |||
| # plt.xlabel('clustering coefficient', fontsize=3) | |||
| # plt.ylabel('count', fontsize=3) | |||
| # G_cluster_mean = sum(G_cluster) / len(G_cluster) | |||
| # # if i % 2 == 0: | |||
| # # plt.title('real average clustering: {:.4f}'.format(G_cluster_mean), fontsize=4) | |||
| # # else: | |||
| # # plt.title('pred average clustering: {:.4f}'.format(G_cluster_mean), fontsize=4) | |||
| # plt.title('average clustering: {:.4f}'.format(G_cluster_mean), fontsize=4) | |||
| # plt.tick_params(axis='both', which='major', labelsize=3) | |||
| # plt.tick_params(axis='both', which='minor', labelsize=3) | |||
| # plt.tight_layout() | |||
| # plt.savefig(fname+'_clustering.png', dpi=600) | |||
| # plt.close() | |||
| # | |||
| # # draw circle distribution | |||
| # plt.switch_backend('agg') | |||
| # for i, G in enumerate(G_list): | |||
| # plt.subplot(row, col, i + 1) | |||
| # cycle_len = [] | |||
| # cycle_all = nx.cycle_basis(G) | |||
| # for item in cycle_all: | |||
| # cycle_len.append(len(item)) | |||
| # | |||
| # bins = np.arange(20) | |||
| # plt.hist(np.array(cycle_len), bins=bins, align='left') | |||
| # plt.xlabel('cycle length', fontsize=3) | |||
| # plt.ylabel('count', fontsize=3) | |||
| # G_cycle_mean = 0 | |||
| # if len(cycle_len)>0: | |||
| # G_cycle_mean = sum(cycle_len) / len(cycle_len) | |||
| # # if i % 2 == 0: | |||
| # # plt.title('real average cycle: {:.4f}'.format(G_cycle_mean), fontsize=4) | |||
| # # else: | |||
| # # plt.title('pred average cycle: {:.4f}'.format(G_cycle_mean), fontsize=4) | |||
| # plt.title('average cycle: {:.4f}'.format(G_cycle_mean), fontsize=4) | |||
| # plt.tick_params(axis='both', which='major', labelsize=3) | |||
| # plt.tick_params(axis='both', which='minor', labelsize=3) | |||
| # plt.tight_layout() | |||
| # plt.savefig(fname+'_cycle.png', dpi=600) | |||
| # plt.close() | |||
| # | |||
| # # draw community distribution | |||
| # plt.switch_backend('agg') | |||
| # for i, G in enumerate(G_list): | |||
| # plt.subplot(row, col, i + 1) | |||
| # parts = community.best_partition(G) | |||
| # values = np.array([parts.get(node) for node in G.nodes()]) | |||
| # counts = np.sort(np.bincount(values)[::-1]) | |||
| # pos = np.arange(len(counts)) | |||
| # plt.bar(pos,counts,align = 'edge') | |||
| # plt.xlabel('community ID', fontsize=3) | |||
| # plt.ylabel('count', fontsize=3) | |||
| # G_community_count = len(counts) | |||
| # # if i % 2 == 0: | |||
| # # plt.title('real average clustering: {}'.format(G_community_count), fontsize=4) | |||
| # # else: | |||
| # # plt.title('pred average clustering: {}'.format(G_community_count), fontsize=4) | |||
| # plt.title('average clustering: {}'.format(G_community_count), fontsize=4) | |||
| # plt.tick_params(axis='both', which='major', labelsize=3) | |||
| # plt.tick_params(axis='both', which='minor', labelsize=3) | |||
| # plt.tight_layout() | |||
| # plt.savefig(fname+'_community.png', dpi=600) | |||
| # plt.close() | |||
| # plt.switch_backend('agg') | |||
| # G_deg = nx.degree_histogram(G) | |||
| # G_deg = np.array(G_deg) | |||
| # # plt.plot(range(len(G_deg)), G_deg, 'r', linewidth = 2) | |||
| # plt.loglog(np.arange(len(G_deg))[G_deg>0], G_deg[G_deg>0], 'r', linewidth=2) | |||
| # plt.savefig('figures/degree_view_' + prefix + '.png', dpi=200) | |||
| # plt.close() | |||
| # degree_sequence = sorted(nx.degree(G).values(), reverse=True) # degree sequence | |||
| # plt.loglog(degree_sequence, 'b-', marker='o') | |||
| # plt.title("Degree rank plot") | |||
| # plt.ylabel("degree") | |||
| # plt.xlabel("rank") | |||
| # plt.savefig('figures/degree_view_' + prefix + '.png', dpi=200) | |||
| # plt.close() | |||
| # directly get graph statistics from adj, obsoleted | |||
| def decode_graph(adj, prefix): | |||
| adj = np.asmatrix(adj) | |||
| G = nx.from_numpy_matrix(adj) | |||
| # G.remove_nodes_from(nx.isolates(G)) | |||
| print('num of nodes: {}'.format(G.number_of_nodes())) | |||
| print('num of edges: {}'.format(G.number_of_edges())) | |||
| G_deg = nx.degree_histogram(G) | |||
| G_deg_sum = [a * b for a, b in zip(G_deg, range(0, len(G_deg)))] | |||
| print('average degree: {}'.format(sum(G_deg_sum) / G.number_of_nodes())) | |||
| if nx.is_connected(G): | |||
| print('average path length: {}'.format(nx.average_shortest_path_length(G))) | |||
| print('average diameter: {}'.format(nx.diameter(G))) | |||
| G_cluster = sorted(list(nx.clustering(G).values())) | |||
| print('average clustering coefficient: {}'.format(sum(G_cluster) / len(G_cluster))) | |||
| cycle_len = [] | |||
| cycle_all = nx.cycle_basis(G, 0) | |||
| for item in cycle_all: | |||
| cycle_len.append(len(item)) | |||
| print('cycles', cycle_len) | |||
| print('cycle count', len(cycle_len)) | |||
| draw_graph(G, prefix=prefix) | |||
| def get_graph(adj): | |||
| ''' | |||
| get a graph from zero-padded adj | |||
| :param adj: | |||
| :return: | |||
| ''' | |||
| # remove all zeros rows and columns | |||
| adj = adj[~np.all(adj == 0, axis=1)] | |||
| adj = adj[:, ~np.all(adj == 0, axis=0)] | |||
| adj = np.asmatrix(adj) | |||
| G = nx.from_numpy_matrix(adj) | |||
| return G | |||
| # save a list of graphs | |||
| def save_graph_list(G_list, fname): | |||
| with open(fname, "wb") as f: | |||
| pickle.dump(G_list, f) | |||
| # pick the first connected component | |||
| def pick_connected_component(G): | |||
| node_list = nx.node_connected_component(G,0) | |||
| return G.subgraph(node_list) | |||
| def pick_connected_component_new(G): | |||
| adj_list = G.adjacency_list() | |||
| for id,adj in enumerate(adj_list): | |||
| id_min = min(adj) | |||
| if id<id_min and id>=1: | |||
| # if id<id_min and id>=4: | |||
| break | |||
| node_list = list(range(id)) # only include node prior than node "id" | |||
| G = G.subgraph(node_list) | |||
| G = max(nx.connected_component_subgraphs(G), key=len) | |||
| return G | |||
| # load a list of graphs | |||
| def load_graph_list(fname,is_real=True): | |||
| with open(fname, "rb") as f: | |||
| graph_list = pickle.load(f) | |||
| for i in range(len(graph_list)): | |||
| edges_with_selfloops = graph_list[i].selfloop_edges() | |||
| if len(edges_with_selfloops)>0: | |||
| graph_list[i].remove_edges_from(edges_with_selfloops) | |||
| if is_real: | |||
| graph_list[i] = max(nx.connected_component_subgraphs(graph_list[i]), key=len) | |||
| graph_list[i] = nx.convert_node_labels_to_integers(graph_list[i]) | |||
| else: | |||
| graph_list[i] = pick_connected_component_new(graph_list[i]) | |||
| return graph_list | |||
| def export_graphs_to_txt(g_list, output_filename_prefix): | |||
| i = 0 | |||
| for G in g_list: | |||
| f = open(output_filename_prefix + '_' + str(i) + '.txt', 'w+') | |||
| for (u, v) in G.edges(): | |||
| idx_u = G.nodes().index(u) | |||
| idx_v = G.nodes().index(v) | |||
| f.write(str(idx_u) + '\t' + str(idx_v) + '\n') | |||
| i += 1 | |||
| def snap_txt_output_to_nx(in_fname): | |||
| G = nx.Graph() | |||
| with open(in_fname, 'r') as f: | |||
| for line in f: | |||
| if not line[0] == '#': | |||
| splitted = re.split('[ \t]', line) | |||
| # self loop might be generated, but should be removed | |||
| u = int(splitted[0]) | |||
| v = int(splitted[1]) | |||
| if not u == v: | |||
| G.add_edge(int(u), int(v)) | |||
| return G | |||
| def test_perturbed(): | |||
| graphs = [] | |||
| for i in range(100,101): | |||
| for j in range(4,5): | |||
| for k in range(500): | |||
| graphs.append(nx.barabasi_albert_graph(i,j)) | |||
| g_perturbed = perturb(graphs, 0.9) | |||
| print([g.number_of_edges() for g in graphs]) | |||
| print([g.number_of_edges() for g in g_perturbed]) | |||
| if __name__ == '__main__': | |||
| #test_perturbed() | |||
| #graphs = load_graph_list('graphs/' + 'GraphRNN_RNN_community4_4_128_train_0.dat') | |||
| #graphs = load_graph_list('graphs/' + 'GraphRNN_RNN_community4_4_128_pred_2500_1.dat') | |||
| graphs = load_graph_list('eval_results/mmsb/' + 'community41.dat') | |||
| for i in range(0, 160, 16): | |||
| draw_graph_list(graphs[i:i+16], 4, 4, fname='figures/community4_' + str(i)) | |||