How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad
Authors: Emmanuel Abbe, Samy Bengio, Aryo Lotfi, Colin Sandon, Omid Saremi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show here experimentally and theoretically under additional assumptions that distributions with high globality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Further, we develop scratchpad techniques and show that: (i) agnostic scratchpads cannot break the globality barrier, (ii) educated scratchpads can break the globality with intermediate steps, although not all such scratchpads can generalize out-of-distribution (OOD), (iii) a notion of inductive scratchpad , that composes the prior information more efficiently, can both break the globality barrier and improve the OOD generalization. |
| Researcher Affiliation | Collaboration | Emmanuel Abbe1,2, Samy Bengio1, Aryo Lotfi2, Colin Sandon2, Omid Saremi1 1Apple 2EPFL |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | Our code is available at https://github.com/aryol/inductive-scratchpad. |
| Open Datasets | Yes | The paper focuses on artificial data that can easily be reproduced, details of the exact models used are provided in the appendix. Further, our code is publicly available at https://github.com/aryol/inductive-scratchpad. |
| Dataset Splits | No | The validation set has the same distribution as the training set showing that the model reaches around 80% accuracy on in-distribution samples. (Figure 7 caption). However, specific percentages or counts for the split are not provided. |
| Hardware Specification | Yes | We used different Nvidia GPU devices for running our experiments including H100, A100, and RTX4090. |
| Software Dependencies | No | Our implementation uses the PyTorch framework [97] and is mostly built on Nano GPT s implementation [98]. The paper mentions software names but not specific version numbers. |
| Experiment Setup | Yes | Our Transformers use causal attention masking and absolute learnable positional embeddings. For most experiments, we use a small model with 6 layers, 6 heads, and an embedding dimension of 384 which results in a model with approximately 10M parameters. We only change the size of the model in Figure 1b where we use models with 8 layers, 8 heads, and an embedding dimension of 512 (approximately 25M parameters), and 12 layers, 12 heads, and an embedding dimension of 768 (roughly 85M parameters). ...We train our base decoder-only model for 2000 iterations... |