Do Transformers Really Perform Badly for Graph Representation?
Authors: Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first conduct experiments on the recent OGB-LSC [21] quantum chemistry regression (i.e., PCQM4M-LSC) challenge, which is currently the biggest graph-level prediction dataset and contains more than 3.8M graphs in total. Then, we report the results on the other three popular tasks: ogbg-molhiv, ogbg-molpcba and ZINC, which come from the OGB [22] and benchmarking-GNN [14] leaderboards. |
| Researcher Affiliation | Collaboration | 1Dalian University of Technology 2Princeton University 3Peking University 4Microsoft Research Asia |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. Equations (8) and (9) describe the Graphormer layer mathematically, but are not presented as pseudocode. |
| Open Source Code | Yes | The code and models of Graphormer will be made publicly available at https://github.com/Microsoft/Graphormer. |
| Open Datasets | Yes | We first conduct experiments on the recent OGB-LSC [21] quantum chemistry regression (i.e., PCQM4M-LSC) challenge, which is currently the biggest graph-level prediction dataset and contains more than 3.8M graphs in total. Then, we report the results on the other three popular tasks: ogbg-molhiv, ogbg-molpcba and ZINC, which come from the OGB [22] and benchmarking-GNN [14] leaderboards. |
| Dataset Splits | No | A detailed description of datasets and training strategies could be found in Appendix B. The main text mentions 'validate MAE' in tables, but does not provide specific split percentages or sample counts for the validation set within the provided text. |
| Hardware Specification | Yes | All models are trained on 8 NVIDIA V100 GPUS for about 2 days. |
| Software Dependencies | No | The paper mentions 'Adam W as the optimizer' but does not provide specific software library names with version numbers (e.g., PyTorch, TensorFlow, or scikit-learn versions) required for reproduction. |
| Experiment Setup | Yes | We primarily report results on two model sizes: Graphormer (L = 12, d = 768), and a smaller one Graphormer SMALL (L = 6, d = 512). Both the number of attention heads in the attention module and the dimensionality of edge features d E are set to 32. We use Adam W as the optimizer, and set the hyper-parameter ϵ to 1e-8 and (β1, β2) to (0.99,0.999). The peak learning rate is set to 2e-4 (3e-4 for Graphormer SMALL) with a 60k-step warm-up stage followed by a linear decay learning rate scheduler. The total training steps are 1M. The batch size is set to 1024. |