Understanding Transformer Reasoning Capabilities via Graph Algorithms
Authors: Clayton Sanford, Bahare Fatemi, Ethan Hall, Anton Tsitsulin, Mehran Kazemi, Jonathan Halcrow, Bryan Perozzi, Vahab Mirrokni
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also support our theoretical analysis with ample empirical evidence using the Graph QA benchmark. These results show that transformers excel at many graph reasoning tasks, even outperforming specialized graph neural networks. |
| Researcher Affiliation | Collaboration | 1Google Research, 2Columbia University,3Google, 4Google Deep Mind |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | While code is not made available, the data can be generated using the Graph QA dataset linked in the appendix. Full training details are made available, which makes the code easily reproducible. The code will be open sourced upon acceptance of the paper. |
| Open Datasets | Yes | We use the Graph QA benchmark tasks [24] for our experiments. We used the public code of the dataset available at https://github.com/google-research/ google-research/tree/master/graphqa. |
| Dataset Splits | Yes | There are 1,000 graphs in the original train set, 500 graphs in the dev set, and 500 graphs in the test set. ... Optimal hyperparameters for each task and model were determined by training on the Graph QAT rain dataset and evaluating performance on the Graph QADev dataset. |
| Hardware Specification | Yes | All experiments were conducted using Google s TPUv3 and TPUv5e accelerators [35]. |
| Software Dependencies | No | We implemented our model in JAX [26] and used Adam W [43, 50] as the optimizer. ... We employed the GLU [73] activation as a non-linearity. The paper does not specify version numbers for JAX, Adam W, or GLU. |
| Experiment Setup | Yes | We fixed the number of iterations as 1,000,000 and train standard decoder-only transformers with L = 12 layers, m = 768 embedding dimension, H = 12 heads, learning rate 5 10 4, and dropout 0.1. |