Training Graph Transformers via Curriculum-Enhanced Attention Distillation

Authors: Yisong Huang, Jin Li, Xinlong Chen, Yang-Geng Fu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our method outperforms many state-of-the-art methods on seven public graph benchmarks, proving its effectiveness. We validate the effectiveness of our proposed method on seven graph benchmark datasets. Our method consistently outperforms existing GTs and GNNs, demonstrating enhanced performance and improved generalization capability.
Researcher Affiliation Academia Yisong Huang1, Jin Li1,2, Xinlong Chen1 & Yang-Geng Fu1 1College of Computer and Data Science, Fuzhou University, Fuzhou, China 2AI Thrust, Information Hub, HKUST (Guangzhou), Guangzhou, China
Pseudocode No The paper contains mathematical equations but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not include an explicit statement about releasing code or a link to a code repository.
Open Datasets Yes Datasets. We evaluate our method on seven benchmark datasets, including citation network datasets Cora, Citeseer, and Pubmed (Sen et al., 2008); Actor co-occurrence network dataset (Chien et al., 2021); and Web KB datasets (Pei et al., 2020) including Cornell, Texas, and Wisconsin.
Dataset Splits Yes For the remaining datasets, we set the train-validation-test split as 48%/32%/20%. We apply the standard splits for the citation network datasets, as in the previous work (Kipf & Welling, 2017). For the remaining datasets, we set the train-validation-test split as 48%/32%/20%.
Hardware Specification Yes All experiments are conducted on one Ge Force RTX 4090 GPU.
Software Dependencies No The paper mentions 'Python and Py Torch and use Adam as the optimizer' but does not specify version numbers for these software components.
Experiment Setup Yes We maintain fixed values for certain hyperparameters: the pre-training epochs of the teacher model are set to 200, the training epochs of the student model to 500, and the weight decay to 5e-4. We conduct hyperparameter tuning for other important hyperparameters on each dataset using grid search. The hyperparameter ranges are presented in Table 6. We provide the specific configurations of hyperparameters on each dataset in Table 7.