The Pitfalls of Next-Token Prediction
Authors: Gregor Bachmann, Vaishnavh Nagarajan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate that the above mechanism leads to complete in-distribution failure in a path-finding setup on a graph, that we propose as a minimal lookahead task. We provide preliminary evidence that this failure can be resolved when training to predict multiple tokens in advance. |
| Researcher Affiliation | Collaboration | 1ETH Zürich, Switzerland 2Google Research, US. |
| Pseudocode | No | The paper describes methods using text and mathematical equations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | We make our code available under https://github.com/gregorbachmann/ Next-Token-Failures |
| Open Datasets | No | The paper describes its custom dataset generation process: 'Dataset. We denote by Gd,l(N) for d, l, N N, a pathstar graph consisting of a center node vstart with degree d N, meaning there are d different paths emerging from the center node, each consisting of l 1 nodes (excluding the start node).' However, it does not provide a direct link, DOI, or specific citation for public access to the generated dataset. |
| Dataset Splits | No | The paper mentions generating 'training and test graphs' and fixing 'the number of samples to 200k' but does not specify any explicit validation dataset splits, proportions, or procedures for model selection. |
| Hardware Specification | No | The paper does not specify the hardware used for experiments, such as specific GPU models, CPU types, or memory configurations. It mentions 'trained the cheaper models for as long as 500 epochs' but no details on the hardware itself. |
| Software Dependencies | No | The paper states 'We optimize using Adam W (Loshchilov & Hutter, 2019)' but does not list any specific software libraries (like PyTorch, TensorFlow) with their version numbers required to reproduce the experiments. |
| Experiment Setup | Yes | When training Transformer models from scratch, we use a small model consisting of nlayers = 12 blocks with embedding dimension edim = 384, nheads = 6 attention heads and MLP expansion factor e = 4, coined GPT-Mini. We train all the models with the Adam W optimizer (Loshchilov & Hutter, 2019). For models trained from scratch we use a learning rate of η = 0.0005 while for pre-trained models we use a smaller one of η = 0.0001. In both cases we use weight decay of strength 0.01. |