Pure Transformers are Powerful Graph Learners

Authors: Jinwoo Kim, Dat Nguyen, Seonwoo Min, Sungjun Cho, Moontae Lee, Honglak Lee, Seunghoon Hong

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first conduct a synthetic experiment that directly confirms our key claims in Lemma 1 (Section 3). Then, we empirically explore the capability of Tokenized Graph Transformer (Token GT) (Section 2) using the PCQM4Mv2 large-scale quantum chemistry regression dataset [27]. In an experiment with PCQM4Mv2 large-scale dataset, we show that Tokenized Graph Transformer (Token GT) performs significantly better than all GNNs and is competitive with Transformer variants with strong graph-specific architectural components [78, 29, 54].
Researcher Affiliation Collaboration 1KAIST 2LG AI Research 3University of Illinois Chicago
Pseudocode No The paper describes the architecture and components of Token GT, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our implementation is available at https://github.com/jw9730/tokengt.
Open Datasets Yes We test our model, named Tokenized Graph Transformer (Token GT), mainly on the PCQM4Mv2 large-scale quantum chemical property prediction dataset containing 3.7M molecular graphs [27].
Dataset Splits Yes We report the Mean Absolute Error (MAE) on the validation set, and report MAE on the unavailable test set if possible. For fine-tuning, we use 1k warmup, 0.1M training steps, and cosine learning rate decay.
Hardware Specification Yes We train the models on 8 RTX 3090 GPUs for 3 days.
Software Dependencies No The paper mentions using 'Adam W optimizer' but does not specify version numbers for any key software components like deep learning frameworks (e.g., PyTorch, TensorFlow), Python, or CUDA.
Experiment Setup Yes For Token GT, we use both node and type identifiers, and use main Transformer encoder configuration based on Graphormer [78] with 12 layers, 768 hidden dimension, and 32 attention heads. We use Adam W optimizer with (β1, β2) = (0.99, 0.999) and weight decay 0.1, and 60k learning rate warmup steps followed by linear decay over 1M iteration with batch size 1024. For fine-tuning, we use 1k warmup, 0.1M training steps, and cosine learning rate decay.