Multi-Scale Distillation from Multiple Graph Neural Networks

Authors: Chunhai Zhang, Jie Liu, Kai Dang, Wenzheng Zhang4337-4344

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted to evaluate the proposed method on four public datasets. The experimental results demonstrate the superiority of our proposed method over state-of-the-art methods.
Researcher Affiliation Collaboration 1College Of Artificial Intelligence, Nankai University, Tianjin, China 2Cloopen AI Research, Beijing, China
Pseudocode Yes Algorithm 1: Multi-scale Knowledge Distillation.
Open Source Code Yes Our code is publicly available at https://github.com/NKU-IIPLab/MSKD.
Open Datasets Yes We conduct a series of node classification tasks on four different datasets, i.e., PPI (Zitnik and Leskovec 2017), Cora, Cite Seer and Pub Med (Sen et al. 2008).
Dataset Splits Yes PPI contains 24 graphs that come from different human tissues and 121 categories, where 20 graphs are used for training, 2 graphs are used for validating and the left 2 graphs are used for testing.
Hardware Specification No The paper does not explicitly state the specific hardware used for running the experiments (e.g., GPU model, CPU type, memory).
Software Dependencies No The paper mentions software components like GAT, GCN, and Adam optimizer but does not provide specific version numbers for these or other libraries like PyTorch or TensorFlow.
Experiment Setup Yes In teacher GAT, each hidden layer has 4 attention heads and 256 hidden features, and the output layer has 6 attention heads and K hidden features. In student GAT, there are 5 layers, each hidden layer has 2 attention heads and 68 hidden features, and the output layer has 2 attention heads and K hidden features. The settings of the number of hidden features in each layer are the same in GCN. In all the methods, the optimizer is Adam, the learning rate is set to 0.005, training epochs are 500 and weight decay equals 0. We tune all other hyperparameters to the best results on the validation set. λ in the Equation (8) is set to 7, 3, 3 and 4 in four datasets.