reproducibilityindex.ai

What Algorithms can Transformers Learn? A Study in Length Generalization

Authors: Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Joshua M. Susskind, Samy Bengio, Preetum Nakkiran

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, we show that there exist algorithmic tasks for which standard decoder-only Transformers trained from scratch naturally exhibit strong length generalization... We present empirical evidence to support the correlation between RASP-simplicity and generalization. We leverage our insights to give new scratchpad formats which yield strong length generalization on traditionally hard tasks (such as parity and addition)... Overall, our work provides a novel perspective on the mechanisms of length generalization and the algorithmic capabilities of Transformers.
Researcher Affiliation	Collaboration	Hattie Zhou , Arwen Bradley , Etai Littwin , Noam Razin , Omid Saremi , Josh Susskind , Samy Bengio , Preetum Nakkiran Apple Mila, Université de Montréal Tel Aviv University
Pseudocode	Yes	Listing 1: RASP-L core functions., Listing 2: RASP-L library functions., Listing 3: RASP-L program for Count., Listing 4: RASP-L program for Mode., Listing 5: RASP-L program for Copy with unique tokens., Listing 6: An addition program that is illegal in RASP-L for several reasons., Listing 7: RASP-L program for addition, with output in reverse order, and index-hints., Listing 8: The required patch to Listing 7, to produce a program for addition in forward order.
Open Source Code	No	No explicit statement about the release of their own source code for the described methodology was found. The paper only mentions 'nanogpt' as a tool used: 'Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT, 2023.'
Open Datasets	No	No specific link, DOI, repository name, or formal citation for a publicly available or open dataset was provided. The paper describes generating synthetic data for each task: 'For all tasks, the length of training examples is sampled uniformly from length 1 up to the max training length.'
Dataset Splits	No	No specific information regarding a 'validation' dataset split was found. The paper describes training on a 'train distribution' and evaluating on 'test' sets: 'We train all of our models to convergence on the train distribution where possible... At test time, we evaluate examples without packing and shifting.'
Hardware Specification	No	No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments were provided. The paper only refers to 'Transformers' and 'decoder-only Transformers'.
Software Dependencies	No	No specific software dependencies with version numbers were listed. The paper mentions 'Adam W optimizer and cosine learning rate schedule' in Table 1 and 'numpy' in Appendix F.1, but without associated version numbers.
Experiment Setup	Yes	Table 1: Experimental hyperparameters. All experiments use Adam W optimizer and cosine learning rate schedule. Count and Copy use weight decay of 0.1 and grad clip of 0. Parity and Mode use weight decay of 0.1 and grad clip of 1. Addition uses weight decay of 0 and grad clip of 1.