What Algorithms can Transformers Learn? A Study in Length Generalization
Authors: Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Joshua M. Susskind, Samy Bengio, Preetum Nakkiran
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we show that there exist algorithmic tasks for which standard decoder-only Transformers trained from scratch naturally exhibit strong length generalization... We present empirical evidence to support the correlation between RASP-simplicity and generalization. We leverage our insights to give new scratchpad formats which yield strong length generalization on traditionally hard tasks (such as parity and addition)... Overall, our work provides a novel perspective on the mechanisms of length generalization and the algorithmic capabilities of Transformers. |
| Researcher Affiliation | Collaboration | Hattie Zhou , Arwen Bradley , Etai Littwin , Noam Razin , Omid Saremi , Josh Susskind , Samy Bengio , Preetum Nakkiran Apple Mila, Université de Montréal Tel Aviv University |
| Pseudocode | Yes | Listing 1: RASP-L core functions., Listing 2: RASP-L library functions., Listing 3: RASP-L program for Count., Listing 4: RASP-L program for Mode., Listing 5: RASP-L program for Copy with unique tokens., Listing 6: An addition program that is illegal in RASP-L for several reasons., Listing 7: RASP-L program for addition, with output in reverse order, and index-hints., Listing 8: The required patch to Listing 7, to produce a program for addition in forward order. |
| Open Source Code | No | No explicit statement about the release of their own source code for the described methodology was found. The paper only mentions 'nanogpt' as a tool used: 'Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT, 2023.' |
| Open Datasets | No | No specific link, DOI, repository name, or formal citation for a publicly available or open dataset was provided. The paper describes generating synthetic data for each task: 'For all tasks, the length of training examples is sampled uniformly from length 1 up to the max training length.' |
| Dataset Splits | No | No specific information regarding a 'validation' dataset split was found. The paper describes training on a 'train distribution' and evaluating on 'test' sets: 'We train all of our models to convergence on the train distribution where possible... At test time, we evaluate examples without packing and shifting.' |
| Hardware Specification | No | No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments were provided. The paper only refers to 'Transformers' and 'decoder-only Transformers'. |
| Software Dependencies | No | No specific software dependencies with version numbers were listed. The paper mentions 'Adam W optimizer and cosine learning rate schedule' in Table 1 and 'numpy' in Appendix F.1, but without associated version numbers. |
| Experiment Setup | Yes | Table 1: Experimental hyperparameters. All experiments use Adam W optimizer and cosine learning rate schedule. Count and Copy use weight decay of 0.1 and grad clip of 0. Parity and Mode use weight decay of 0.1 and grad clip of 1. Addition uses weight decay of 0 and grad clip of 1. |