Algorithmic progress in language models

Authors: Wing Hin Anson Ho, Tamay Besiroglu, Ege Erdil, Zifan Guo, David Owen, Robi Rahman, David Atkinson, Neil Thompson, Jaime Sevilla

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate the rate at which algorithms for pre-training language models have improved since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 90% confidence interval of around 2 to 22 months, substantially faster than hardware gains per Moore s Law.
Researcher Affiliation Collaboration Anson Ho1 Tamay Besiroglu1,2 Ege Erdil1 David Owen1 Robi Rahman1 Zifan Carl Guo2 David Atkinson1,3 Neil Thompson2 Jaime Sevilla1 1Epoch. 2MIT Future Tech, CSAIL, 3Northeastern University.
Pseudocode No The paper describes mathematical models and estimation approaches but does not include any pseudocode or algorithm blocks.
Open Source Code Yes You can find our code and data here: https://github.com/epoch-research/ lm-algorithmic-progress.
Open Datasets Yes Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 90% confidence interval of around 2 to 22 months, substantially faster than hardware gains per Moore s Law.
Dataset Splits Yes We perform extensive cross-validation exercises to identify the variant of the model that fits the data best. We evaluate around 90 different model specifications through leave-one-out-cross validation and pick the models that perform best on relevant out-of-sample metrics, see Appendix J for more details.
Hardware Specification No The paper states 'Our empirical results are obtained by fitting a dataset in Google Colab notebooks.' which is a computing environment but lacks specific hardware details like GPU/CPU models or memory.
Software Dependencies No The paper does not list specific software dependencies with version numbers used for its analysis, beyond general mentions like 'Google Colab notebooks'.
Experiment Setup Yes The core model that we use was chosen based on leave-one-out cross validation, and is defined similarly to equation 3 but with a few modifications. The most important change is that A and B are estimated separately for each benchmark, whereas all other parameters are benchmark-agnostic. In order to help with model fitting, we normalize N and D to some minimum N0 and D0 values in our dataset, and reparameterize A and B as exponentials. In full, our model is L = exp[α const αyear(Y Y0) αparam log(N/N0)]+exp[β const βyear(Y Y0) βdata log(D/D0)], (8)