reproducibilityindex.ai

Arrows of Time for Large Language Models

Authors: Vassilis Papadopoulos, Jérémie Wenger, Clément Hongler

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically find a time asymmetry in their ability to learn natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...).
Researcher Affiliation	Academia	1FSL/Institute of Physics, EPFL 2CSFT/Institute of Mathematics, EPFL, Lausanne, Switzerland 3Department of Computing, Goldsmiths/University of London, London, UK.
Pseudocode	No	No structured pseudocode or algorithm blocks were found.
Open Source Code	Yes	our implementation (with code in the supplementary material) is derived from min GPT (Karpathy, 2023).
Open Datasets	Yes	We conduct our natural language experiments on the CC100 dataset (Wenzek et al., 2019; Conneau et al., 2020), which provides large monolingual text datasets for various of languages and is reasonably homogeneous across languages.
Dataset Splits	Yes	We withhold 250k sentences from the dataset for validation.
Hardware Specification	Yes	All experiments (save for the 512 context size) were run on a single A100 GPU
Software Dependencies	No	The paper mentions 'min GPT (Karpathy, 2023)', 'Adam W optimizer (Loshchilov & Hutter, 2019)', and 'pytorch.nn module of the Pytorch Python library' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	For all models, we use the Adam W optimizer (Loshchilov & Hutter, 2019) with base learning rate of 10 4 and a learning rate schedule with a warmup, followed by cosine annealing with warm restarts (Loshchilov & Hutter, 2017). These hyperparameters are mostly kept constant across different experiments, although the period of the warm restarts might be tweaked to synchronize the end of training with the end of a cycle, see Appendix A for details.