Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Auto-Regressive Next-Token Predictors are Universal Learners
Authors: Eran Malach
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results demonstrate that the power of today s LLMs can be attributed, to a great extent, to the autoregressive next-token training scheme, and not necessarily to a particular choice of architecture. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks. |
| Researcher Affiliation | Academia | 1Harvard University, Kempner Institute for the Study of Natural and Artificial Intelligence. Correspondence to: Eran Malach <EMAIL>. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about the release of source code or a link to a code repository. |
| Open Datasets | Yes | We train a linear next-token prediction network on the Tiny Stories dataset (Eldan & Li, 2023), a collection of short stories composed of simple words. |
| Dataset Splits | Yes | We split all pairs of 4-digit numbers arbitrarily, use 75% for training, and keep the rest for validation. |
| Hardware Specification | Yes | The model is trained for 51/2 hours on a single A100 machine. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers. |
| Experiment Setup | Yes | We train a linear model with context length of T = 64 on this dataset. The model has only three layers: 1) a standard (linear) embedding layer, mapping tokens into a vector of dimension d = 256; 2) a linear layer mapping d T to d T (using standard masking for next-token prediction during training); 3) an output embedding layer mapping vectors of dimension d = 256 back into the output space of all tokens... The model is optimized with the cross-entropy loss, using a softmax operation applied to the outputs. |