An empirical analysis of compute-optimal large language model training
Authors: Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, Laurent Sifre
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4 more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. |
| Researcher Affiliation | Industry | Jordan Hoffmann Sebastian Borgeaud Arthur Mensch Elena Buchatskaya Trevor Cai Eliza Rutherford Diego de Las Casas Lisa Anne Hendricks Johannes Welbl Aidan Clark Tom Hennigan Eric Noland Katie Millican George van den Driessche Bogdan Damoc Aurelia Guy Simon Osindero Karen Simonyan Erich Elsen Oriol Vinyals Jack W. Rae Laurent Sifre Equal contributions DeepMind (sborgeaud|amensch|sifre)@deepmind.com |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The code and the data are proprietary. However, for the scaling methodology we provide clear instructions on how to reproduce the results. |
| Open Datasets | Yes | We train Chinchilla on Massive Text (the same dataset as Gopher) but use a slightly different subset distribution (Table A1) to account for the increased number of training tokens. We use the same data as in Rae et al. [38] which uses a proprietary dataset. We also show results with an open source dataset C4 [40]. |
| Dataset Splits | No | No explicit train/validation/test splits were provided in the paper for their experimental setup. It refers to using existing benchmarks and the evaluation details being 'the same as described in [38]', but does not detail the splits within this paper. |
| Hardware Specification | Yes | All models in this analysis have been trained on TPUv3/TPUv4 [22] with JAX [5] and Haiku [17]. |
| Software Dependencies | No | The paper mentions software like JAX, Haiku, AdamW, Adam, and SentencePiece, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | The full set of hyperparameters used to train Chinchilla are given in Table 3. Chinchilla uses the same model architecture and training setup as Gopher with the exception of the differences listed below. [...] Model Layers Number Heads Key/Value Size dmodel Max LR Batch Size Chinchilla 70B 80 64 128 8,192 1 10 4 1.5M 3M |