Arrows of Time for Large Language Models

Authors: Vassilis Papadopoulos, Jérémie Wenger, Clément Hongler

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically find a time asymmetry in their ability to learn natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...).
Researcher Affiliation Academia 1FSL/Institute of Physics, EPFL 2CSFT/Institute of Mathematics, EPFL, Lausanne, Switzerland 3Department of Computing, Goldsmiths/University of London, London, UK.
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code Yes our implementation (with code in the supplementary material) is derived from min GPT (Karpathy, 2023).
Open Datasets Yes We conduct our natural language experiments on the CC100 dataset (Wenzek et al., 2019; Conneau et al., 2020), which provides large monolingual text datasets for various of languages and is reasonably homogeneous across languages.
Dataset Splits Yes We withhold 250k sentences from the dataset for validation.
Hardware Specification Yes All experiments (save for the 512 context size) were run on a single A100 GPU
Software Dependencies No The paper mentions 'min GPT (Karpathy, 2023)', 'Adam W optimizer (Loshchilov & Hutter, 2019)', and 'pytorch.nn module of the Pytorch Python library' but does not provide specific version numbers for these software components.
Experiment Setup Yes For all models, we use the Adam W optimizer (Loshchilov & Hutter, 2019) with base learning rate of 10 4 and a learning rate schedule with a warmup, followed by cosine annealing with warm restarts (Loshchilov & Hutter, 2017). These hyperparameters are mostly kept constant across different experiments, although the period of the warm restarts might be tweaked to synchronize the end of training with the end of a cycle, see Appendix A for details.