Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reconciling Kaplan and Chinchilla Scaling Laws

Authors: Tim Pearce, Jinyeop Song

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper finds that much of this discrepancy can be attributed to Kaplan counting non-embedding rather than total parameters, combined with their analysis being performed at small scale. Simulating the Chinchilla study under these conditions produces biased scaling coefficients close to Kaplan s. Section 4 experimentally verifies our analysis by training a set of language models at tiny scale and conducting scaling law analyses under various settings.
Researcher Affiliation Collaboration Tim Pearce Microsoft Research Jinyeop Song MIT
Pseudocode No The paper includes Figure 1 as an "Overview of the approach used to reconcile the two studies." which is a flowchart, but no explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps in a code-like format.
Open Source Code Yes Code for analysis: https://github.com/Tea Pearce/Reconciling_Kaplan_Chinchilla_Scaling_Laws
Open Datasets Yes We trained five models of sizes, NT [0.8M, 1.6M, 2.1M, 3.3M, 4.6M] on the Book Corpus dataset.
Dataset Splits No The paper mentions using the "Book Corpus dataset" and describes training token budgets and batch sizes (e.g., "total training tokens D [262M, 262M, 262M, 524M, 524M]") for model training. However, it does not specify explicit train/test/validation splits for the dataset itself, only the total amount of tokens used for training.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions "numpy s polyfit" in Section 3, but does not provide specific version numbers for numpy or any other key software dependencies or libraries used in the analysis or experiments.
Experiment Setup Yes Models were trained for updates [4000, 4000, 4000, 8000, 8000], batchsize was 65,536 tokens per update, for total training tokens D [262M, 262M, 262M, 524M, 524M]. The best learning rate for each model size was chosen [0.001, 0.005, 0.01, 0.05] and no annealing was applied.