Algorithmic progress in language models
Authors: Wing Hin Anson Ho, Tamay Besiroglu, Ege Erdil, Zifan Guo, David Owen, Robi Rahman, David Atkinson, Neil Thompson, Jaime Sevilla
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate the rate at which algorithms for pre-training language models have improved since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 90% confidence interval of around 2 to 22 months, substantially faster than hardware gains per Moore s Law. |
| Researcher Affiliation | Collaboration | Anson Ho1 Tamay Besiroglu1,2 Ege Erdil1 David Owen1 Robi Rahman1 Zifan Carl Guo2 David Atkinson1,3 Neil Thompson2 Jaime Sevilla1 1Epoch. 2MIT Future Tech, CSAIL, 3Northeastern University. |
| Pseudocode | No | The paper describes mathematical models and estimation approaches but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | You can find our code and data here: https://github.com/epoch-research/ lm-algorithmic-progress. |
| Open Datasets | Yes | Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 90% confidence interval of around 2 to 22 months, substantially faster than hardware gains per Moore s Law. |
| Dataset Splits | Yes | We perform extensive cross-validation exercises to identify the variant of the model that fits the data best. We evaluate around 90 different model specifications through leave-one-out-cross validation and pick the models that perform best on relevant out-of-sample metrics, see Appendix J for more details. |
| Hardware Specification | No | The paper states 'Our empirical results are obtained by fitting a dataset in Google Colab notebooks.' which is a computing environment but lacks specific hardware details like GPU/CPU models or memory. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers used for its analysis, beyond general mentions like 'Google Colab notebooks'. |
| Experiment Setup | Yes | The core model that we use was chosen based on leave-one-out cross validation, and is defined similarly to equation 3 but with a few modifications. The most important change is that A and B are estimated separately for each benchmark, whereas all other parameters are benchmark-agnostic. In order to help with model fitting, we normalize N and D to some minimum N0 and D0 values in our dataset, and reparameterize A and B as exponentials. In full, our model is L = exp[α const αyear(Y Y0) αparam log(N/N0)]+exp[β const βyear(Y Y0) βdata log(D/D0)], (8) |