Understanding Transformer Reasoning Capabilities via Graph Algorithms

Authors: Clayton Sanford, Bahare Fatemi, Ethan Hall, Anton Tsitsulin, Mehran Kazemi, Jonathan Halcrow, Bryan Perozzi, Vahab Mirrokni

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also support our theoretical analysis with ample empirical evidence using the Graph QA benchmark. These results show that transformers excel at many graph reasoning tasks, even outperforming specialized graph neural networks.
Researcher Affiliation Collaboration 1Google Research, 2Columbia University,3Google, 4Google Deep Mind
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No While code is not made available, the data can be generated using the Graph QA dataset linked in the appendix. Full training details are made available, which makes the code easily reproducible. The code will be open sourced upon acceptance of the paper.
Open Datasets Yes We use the Graph QA benchmark tasks [24] for our experiments. We used the public code of the dataset available at https://github.com/google-research/ google-research/tree/master/graphqa.
Dataset Splits Yes There are 1,000 graphs in the original train set, 500 graphs in the dev set, and 500 graphs in the test set. ... Optimal hyperparameters for each task and model were determined by training on the Graph QAT rain dataset and evaluating performance on the Graph QADev dataset.
Hardware Specification Yes All experiments were conducted using Google s TPUv3 and TPUv5e accelerators [35].
Software Dependencies No We implemented our model in JAX [26] and used Adam W [43, 50] as the optimizer. ... We employed the GLU [73] activation as a non-linearity. The paper does not specify version numbers for JAX, Adam W, or GLU.
Experiment Setup Yes We fixed the number of iterations as 1,000,000 and train standard decoder-only transformers with L = 12 layers, m = 768 embedding dimension, H = 12 heads, learning rate 5 10 4, and dropout 0.1.