Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reconciling Kaplan and Chinchilla Scaling Laws

Authors: Tim Pearce, Jinyeop Song

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper finds that much of this discrepancy can be attributed to Kaplan counting non-embedding rather than total parameters, combined with their analysis being performed at small scale. Simulating the Chinchilla study under these conditions produces biased scaling coefficients close to Kaplan s. Section 4 experimentally verifies our analysis by training a set of language models at tiny scale and conducting scaling law analyses under various settings.
Researcher Affiliation	Collaboration	Tim Pearce Microsoft Research Jinyeop Song MIT
Pseudocode	No	The paper includes Figure 1 as an "Overview of the approach used to reconcile the two studies." which is a flowchart, but no explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps in a code-like format.
Open Source Code	Yes	Code for analysis: https://github.com/Tea Pearce/Reconciling_Kaplan_Chinchilla_Scaling_Laws
Open Datasets	Yes	We trained five models of sizes, NT [0.8M, 1.6M, 2.1M, 3.3M, 4.6M] on the Book Corpus dataset.
Dataset Splits	No	The paper mentions using the "Book Corpus dataset" and describes training token budgets and batch sizes (e.g., "total training tokens D [262M, 262M, 262M, 524M, 524M]") for model training. However, it does not specify explicit train/test/validation splits for the dataset itself, only the total amount of tokens used for training.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions "numpy s polyfit" in Section 3, but does not provide specific version numbers for numpy or any other key software dependencies or libraries used in the analysis or experiments.
Experiment Setup	Yes	Models were trained for updates [4000, 4000, 4000, 8000, 8000], batchsize was 65,536 tokens per update, for total training tokens D [262M, 262M, 262M, 524M, 524M]. The best learning rate for each model size was chosen [0.001, 0.005, 0.01, 0.05] and no annealing was applied.