reproducibilityindex.ai

Auto-Regressive Next-Token Predictors are Universal Learners

Authors: Eran Malach

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results demonstrate that the power of today s LLMs can be attributed, to a great extent, to the autoregressive next-token training scheme, and not necessarily to a particular choice of architecture. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks.
Researcher Affiliation	Academia	1Harvard University, Kempner Institute for the Study of Natural and Artificial Intelligence. Correspondence to: Eran Malach <emalach@fas.harvard.edu>.
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about the release of source code or a link to a code repository.
Open Datasets	Yes	We train a linear next-token prediction network on the Tiny Stories dataset (Eldan & Li, 2023), a collection of short stories composed of simple words.
Dataset Splits	Yes	We split all pairs of 4-digit numbers arbitrarily, use 75% for training, and keep the rest for validation.
Hardware Specification	Yes	The model is trained for 51/2 hours on a single A100 machine.
Software Dependencies	No	The paper does not specify software dependencies with version numbers.
Experiment Setup	Yes	We train a linear model with context length of T = 64 on this dataset. The model has only three layers: 1) a standard (linear) embedding layer, mapping tokens into a vector of dimension d = 256; 2) a linear layer mapping d T to d T (using standard masking for next-token prediction during training); 3) an output embedding layer mapping vectors of dimension d = 256 back into the output space of all tokens... The model is optimized with the cross-entropy loss, using a softmax operation applied to the outputs.