How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

Authors: Emmanuel Abbe, Samy Bengio, Aryo Lotfi, Colin Sandon, Omid Saremi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show here experimentally and theoretically under additional assumptions that distributions with high globality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Further, we develop scratchpad techniques and show that: (i) agnostic scratchpads cannot break the globality barrier, (ii) educated scratchpads can break the globality with intermediate steps, although not all such scratchpads can generalize out-of-distribution (OOD), (iii) a notion of inductive scratchpad , that composes the prior information more efficiently, can both break the globality barrier and improve the OOD generalization.
Researcher Affiliation Collaboration Emmanuel Abbe1,2, Samy Bengio1, Aryo Lotfi2, Colin Sandon2, Omid Saremi1 1Apple 2EPFL
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code Yes Our code is available at https://github.com/aryol/inductive-scratchpad.
Open Datasets Yes The paper focuses on artificial data that can easily be reproduced, details of the exact models used are provided in the appendix. Further, our code is publicly available at https://github.com/aryol/inductive-scratchpad.
Dataset Splits No The validation set has the same distribution as the training set showing that the model reaches around 80% accuracy on in-distribution samples. (Figure 7 caption). However, specific percentages or counts for the split are not provided.
Hardware Specification Yes We used different Nvidia GPU devices for running our experiments including H100, A100, and RTX4090.
Software Dependencies No Our implementation uses the PyTorch framework [97] and is mostly built on Nano GPT s implementation [98]. The paper mentions software names but not specific version numbers.
Experiment Setup Yes Our Transformers use causal attention masking and absolute learnable positional embeddings. For most experiments, we use a small model with 6 layers, 6 heads, and an embedding dimension of 384 which results in a model with approximately 10M parameters. We only change the size of the model in Figure 1b where we use models with 8 layers, 8 heads, and an embedding dimension of 512 (approximately 25M parameters), and 12 layers, 12 heads, and an embedding dimension of 768 (roughly 85M parameters). ...We train our base decoder-only model for 2000 iterations...