Pure Transformers are Powerful Graph Learners
Authors: Jinwoo Kim, Dat Nguyen, Seonwoo Min, Sungjun Cho, Moontae Lee, Honglak Lee, Seunghoon Hong
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first conduct a synthetic experiment that directly confirms our key claims in Lemma 1 (Section 3). Then, we empirically explore the capability of Tokenized Graph Transformer (Token GT) (Section 2) using the PCQM4Mv2 large-scale quantum chemistry regression dataset [27]. In an experiment with PCQM4Mv2 large-scale dataset, we show that Tokenized Graph Transformer (Token GT) performs significantly better than all GNNs and is competitive with Transformer variants with strong graph-specific architectural components [78, 29, 54]. |
| Researcher Affiliation | Collaboration | 1KAIST 2LG AI Research 3University of Illinois Chicago |
| Pseudocode | No | The paper describes the architecture and components of Token GT, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our implementation is available at https://github.com/jw9730/tokengt. |
| Open Datasets | Yes | We test our model, named Tokenized Graph Transformer (Token GT), mainly on the PCQM4Mv2 large-scale quantum chemical property prediction dataset containing 3.7M molecular graphs [27]. |
| Dataset Splits | Yes | We report the Mean Absolute Error (MAE) on the validation set, and report MAE on the unavailable test set if possible. For fine-tuning, we use 1k warmup, 0.1M training steps, and cosine learning rate decay. |
| Hardware Specification | Yes | We train the models on 8 RTX 3090 GPUs for 3 days. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer' but does not specify version numbers for any key software components like deep learning frameworks (e.g., PyTorch, TensorFlow), Python, or CUDA. |
| Experiment Setup | Yes | For Token GT, we use both node and type identifiers, and use main Transformer encoder configuration based on Graphormer [78] with 12 layers, 768 hidden dimension, and 32 attention heads. We use Adam W optimizer with (β1, β2) = (0.99, 0.999) and weight decay 0.1, and 60k learning rate warmup steps followed by linear decay over 1M iteration with batch size 1024. For fine-tuning, we use 1k warmup, 0.1M training steps, and cosine learning rate decay. |