Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

Authors: Nikhil Sardana, Jacob Portes, Sasha Doubov, Jonathan Frankle

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Furthermore, we train 47 models of varying sizes and parameter counts to validate our formula and find that model quality continues to improve as we scale tokens per parameter to extreme ranges (up to 10,000).
Researcher Affiliation Industry 1Databricks Mosaic ML, United States of America.
Pseudocode No The paper includes mathematical derivations and equations in Appendix A, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using open-source tools like the MPT architecture and Evaluation Gauntlet, but it does not provide an explicit statement or link to the source code for the methodology developed in this paper.
Open Datasets No The paper states: 'Our dataset consists of trillions of tokens of general web text and code.' However, it does not provide any concrete access information (link, DOI, or citation to a public source) for this training dataset. While it evaluates on known public benchmarks, the primary dataset used for training is not indicated as publicly available.
Dataset Splits No The paper states it trains for 'a single epoch' and does not mention explicit training/validation/test splits, specific percentages, or sample counts for data partitioning. While it uses an 'Evaluation Gauntlet', this refers to evaluation tasks, not a dataset split for validation purposes.
Hardware Specification Yes Costs are calculated assuming training and inference on A100-80GB and A100-40GB accelerators, respectively.
Software Dependencies No The paper mentions software components like 'ALiBi', 'Grouped Query Attention', and 'the Lion optimizer', but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Table 4, titled 'Model Training Configurations', provides specific hyperparameters such as Learning Rate and Batch Size for different model sizes. It also details other training settings like 'Lion optimizer (β1 = 0.9, β2 = 0.95) with weight decay equal to the learning rate, cosine warmup (αf = 0.1) with a duration equal to 3 times the number of model parameters, and norm gradient clipping (threshold = 1). A maximum sequence length of 4096 tokens was used.'