Arrows of Time for Large Language Models
Authors: Vassilis Papadopoulos, Jérémie Wenger, Clément Hongler
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically find a time asymmetry in their ability to learn natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). |
| Researcher Affiliation | Academia | 1FSL/Institute of Physics, EPFL 2CSFT/Institute of Mathematics, EPFL, Lausanne, Switzerland 3Department of Computing, Goldsmiths/University of London, London, UK. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | our implementation (with code in the supplementary material) is derived from min GPT (Karpathy, 2023). |
| Open Datasets | Yes | We conduct our natural language experiments on the CC100 dataset (Wenzek et al., 2019; Conneau et al., 2020), which provides large monolingual text datasets for various of languages and is reasonably homogeneous across languages. |
| Dataset Splits | Yes | We withhold 250k sentences from the dataset for validation. |
| Hardware Specification | Yes | All experiments (save for the 512 context size) were run on a single A100 GPU |
| Software Dependencies | No | The paper mentions 'min GPT (Karpathy, 2023)', 'Adam W optimizer (Loshchilov & Hutter, 2019)', and 'pytorch.nn module of the Pytorch Python library' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | For all models, we use the Adam W optimizer (Loshchilov & Hutter, 2019) with base learning rate of 10 4 and a learning rate schedule with a warmup, followed by cosine annealing with warm restarts (Loshchilov & Hutter, 2017). These hyperparameters are mostly kept constant across different experiments, although the period of the warm restarts might be tweaked to synchronize the end of training with the end of a cycle, see Appendix A for details. |