Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Authors: Nikhil Sardana, Jacob Portes, Sasha Doubov, Jonathan Frankle
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Furthermore, we train 47 models of varying sizes and parameter counts to validate our formula and find that model quality continues to improve as we scale tokens per parameter to extreme ranges (up to 10,000). |
| Researcher Affiliation | Industry | 1Databricks Mosaic ML, United States of America. |
| Pseudocode | No | The paper includes mathematical derivations and equations in Appendix A, but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using open-source tools like the MPT architecture and Evaluation Gauntlet, but it does not provide an explicit statement or link to the source code for the methodology developed in this paper. |
| Open Datasets | No | The paper states: 'Our dataset consists of trillions of tokens of general web text and code.' However, it does not provide any concrete access information (link, DOI, or citation to a public source) for this training dataset. While it evaluates on known public benchmarks, the primary dataset used for training is not indicated as publicly available. |
| Dataset Splits | No | The paper states it trains for 'a single epoch' and does not mention explicit training/validation/test splits, specific percentages, or sample counts for data partitioning. While it uses an 'Evaluation Gauntlet', this refers to evaluation tasks, not a dataset split for validation purposes. |
| Hardware Specification | Yes | Costs are calculated assuming training and inference on A100-80GB and A100-40GB accelerators, respectively. |
| Software Dependencies | No | The paper mentions software components like 'ALiBi', 'Grouped Query Attention', and 'the Lion optimizer', but it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Table 4, titled 'Model Training Configurations', provides specific hyperparameters such as Learning Rate and Batch Size for different model sizes. It also details other training settings like 'Lion optimizer (β1 = 0.9, β2 = 0.95) with weight decay equal to the learning rate, cosine warmup (αf = 0.1) with a duration equal to 3 times the number of model parameters, and norm gradient clipping (threshold = 1). A maximum sequence length of 4096 tokens was used.' |