The Pitfalls of Next-Token Prediction

Authors: Gregor Bachmann, Vaishnavh Nagarajan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate that the above mechanism leads to complete in-distribution failure in a path-finding setup on a graph, that we propose as a minimal lookahead task. We provide preliminary evidence that this failure can be resolved when training to predict multiple tokens in advance.
Researcher Affiliation Collaboration 1ETH Zürich, Switzerland 2Google Research, US.
Pseudocode No The paper describes methods using text and mathematical equations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes We make our code available under https://github.com/gregorbachmann/ Next-Token-Failures
Open Datasets No The paper describes its custom dataset generation process: 'Dataset. We denote by Gd,l(N) for d, l, N N, a pathstar graph consisting of a center node vstart with degree d N, meaning there are d different paths emerging from the center node, each consisting of l 1 nodes (excluding the start node).' However, it does not provide a direct link, DOI, or specific citation for public access to the generated dataset.
Dataset Splits No The paper mentions generating 'training and test graphs' and fixing 'the number of samples to 200k' but does not specify any explicit validation dataset splits, proportions, or procedures for model selection.
Hardware Specification No The paper does not specify the hardware used for experiments, such as specific GPU models, CPU types, or memory configurations. It mentions 'trained the cheaper models for as long as 500 epochs' but no details on the hardware itself.
Software Dependencies No The paper states 'We optimize using Adam W (Loshchilov & Hutter, 2019)' but does not list any specific software libraries (like PyTorch, TensorFlow) with their version numbers required to reproduce the experiments.
Experiment Setup Yes When training Transformer models from scratch, we use a small model consisting of nlayers = 12 blocks with embedding dimension edim = 384, nheads = 6 attention heads and MLP expansion factor e = 4, coined GPT-Mini. We train all the models with the Adam W optimizer (Loshchilov & Hutter, 2019). For models trained from scratch we use a learning rate of η = 0.0005 while for pre-trained models we use a smaller one of η = 0.0001. In both cases we use weight decay of strength 0.01.