Auto-Regressive Next-Token Predictors are Universal Learners
Authors: Eran Malach
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results demonstrate that the power of today s LLMs can be attributed, to a great extent, to the autoregressive next-token training scheme, and not necessarily to a particular choice of architecture. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks. |
| Researcher Affiliation | Academia | 1Harvard University, Kempner Institute for the Study of Natural and Artificial Intelligence. Correspondence to: Eran Malach <emalach@fas.harvard.edu>. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about the release of source code or a link to a code repository. |
| Open Datasets | Yes | We train a linear next-token prediction network on the Tiny Stories dataset (Eldan & Li, 2023), a collection of short stories composed of simple words. |
| Dataset Splits | Yes | We split all pairs of 4-digit numbers arbitrarily, use 75% for training, and keep the rest for validation. |
| Hardware Specification | Yes | The model is trained for 51/2 hours on a single A100 machine. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers. |
| Experiment Setup | Yes | We train a linear model with context length of T = 64 on this dataset. The model has only three layers: 1) a standard (linear) embedding layer, mapping tokens into a vector of dimension d = 256; 2) a linear layer mapping d T to d T (using standard masking for next-token prediction during training); 3) an output embedding layer mapping vectors of dimension d = 256 back into the output space of all tokens... The model is optimized with the cross-entropy loss, using a softmax operation applied to the outputs. |