Quantifying the Knowledge in GNNs for Reliable Distillation into MLPs

Authors: Lirong Wu, Haitao Lin, Yufei Huang, Stan Z. Li

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that KRD improves over the vanilla MLPs by 12.62% and outperforms its corresponding teacher GNNs by 2.16% averaged over 7 datasets and 3 GNN architectures.
Researcher Affiliation Academia 1AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China.
Pseudocode Yes Algorithm 1 Algorithm for KRD framework (Transductive)
Open Source Code Yes Codes are publicly available at: https://github.com/Lirong Wu/RKD.
Open Datasets Yes The effectiveness of the KRD framework is evaluated on seven real-world datasets, including Cora (Sen et al., 2008), Citeseer (Giles et al., 1998), Pubmed (Mc Callum et al., 2000), Coauthor-CS, Coauthor-Physics, Amazon Photo (Shchur et al., 2018), and ogbn-arxiv (Hu et al., 2020).
Dataset Splits Yes Concretely, the input and output of two settings are: (1) Transductive: training on X and YL and testing on (XU, YU). (2) Inductive: training on XL XU obs and YL and testing on (XU ind, YU ind)... For a fairer comparison, the model with the highest validation accuracy is selected for testing.
Hardware Specification Yes implemented based on the standard implementation in the DGL library (Wang et al., 2019) using the Py Torch 1.6.0 with Intel(R) Xeon(R) Gold 6240R @ 2.40GHz CPU and NVIDIA V100 GPU.
Software Dependencies Yes implemented based on the standard implementation in the DGL library (Wang et al., 2019) using the Py Torch 1.6.0 with Intel(R) Xeon(R) Gold 6240R @ 2.40GHz CPU and NVIDIA V100 GPU.
Experiment Setup Yes The following hyperparameters are set the same for all datasets: Epoch E = 500, noise variance δ = 1.0, and momentum rate η = 0.99 (0.9 for ogb-arxiv). The other dataset-specific hyperparameters are determined by an Auto ML toolkit NNI with the hyperparameter search spaces as: hidden dimension F = {128, 256, 512, 1024, 2048}, layer number L = {2, 3}, distillation temperature τ = {0.8, 0.9, 1.0, 1.1, 1.2}, loss weight α = {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}, learning rate lr = {0.001, 0.005, 0.01}, and weight decay decay = {0.0, 0.0005, 0.001}.