Auto-Regressive Next-Token Predictors are Universal Learners

Authors: Eran Malach

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results demonstrate that the power of today s LLMs can be attributed, to a great extent, to the autoregressive next-token training scheme, and not necessarily to a particular choice of architecture. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks.
Researcher Affiliation Academia 1Harvard University, Kempner Institute for the Study of Natural and Artificial Intelligence. Correspondence to: Eran Malach <emalach@fas.harvard.edu>.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about the release of source code or a link to a code repository.
Open Datasets Yes We train a linear next-token prediction network on the Tiny Stories dataset (Eldan & Li, 2023), a collection of short stories composed of simple words.
Dataset Splits Yes We split all pairs of 4-digit numbers arbitrarily, use 75% for training, and keep the rest for validation.
Hardware Specification Yes The model is trained for 51/2 hours on a single A100 machine.
Software Dependencies No The paper does not specify software dependencies with version numbers.
Experiment Setup Yes We train a linear model with context length of T = 64 on this dataset. The model has only three layers: 1) a standard (linear) embedding layer, mapping tokens into a vector of dimension d = 256; 2) a linear layer mapping d T to d T (using standard masking for next-token prediction during training); 3) an output embedding layer mapping vectors of dimension d = 256 back into the output space of all tokens... The model is optimized with the cross-entropy loss, using a softmax operation applied to the outputs.