reproducibilityindex.ai

The Pitfalls of Next-Token Prediction

Authors: Gregor Bachmann, Vaishnavh Nagarajan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that the above mechanism leads to complete in-distribution failure in a path-finding setup on a graph, that we propose as a minimal lookahead task. We provide preliminary evidence that this failure can be resolved when training to predict multiple tokens in advance.
Researcher Affiliation	Collaboration	1ETH Zürich, Switzerland 2Google Research, US.
Pseudocode	No	The paper describes methods using text and mathematical equations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	We make our code available under https://github.com/gregorbachmann/ Next-Token-Failures
Open Datasets	No	The paper describes its custom dataset generation process: 'Dataset. We denote by Gd,l(N) for d, l, N N, a pathstar graph consisting of a center node vstart with degree d N, meaning there are d different paths emerging from the center node, each consisting of l 1 nodes (excluding the start node).' However, it does not provide a direct link, DOI, or specific citation for public access to the generated dataset.
Dataset Splits	No	The paper mentions generating 'training and test graphs' and fixing 'the number of samples to 200k' but does not specify any explicit validation dataset splits, proportions, or procedures for model selection.
Hardware Specification	No	The paper does not specify the hardware used for experiments, such as specific GPU models, CPU types, or memory configurations. It mentions 'trained the cheaper models for as long as 500 epochs' but no details on the hardware itself.
Software Dependencies	No	The paper states 'We optimize using Adam W (Loshchilov & Hutter, 2019)' but does not list any specific software libraries (like PyTorch, TensorFlow) with their version numbers required to reproduce the experiments.
Experiment Setup	Yes	When training Transformer models from scratch, we use a small model consisting of nlayers = 12 blocks with embedding dimension edim = 384, nheads = 6 attention heads and MLP expansion factor e = 4, coined GPT-Mini. We train all the models with the Adam W optimizer (Loshchilov & Hutter, 2019). For models trained from scratch we use a learning rate of η = 0.0005 while for pre-trained models we use a smaller one of η = 0.0001. In both cases we use weight decay of strength 0.01.