Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reconciling Kaplan and Chinchilla Scaling Laws
Authors: Tim Pearce, Jinyeop Song
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper finds that much of this discrepancy can be attributed to Kaplan counting non-embedding rather than total parameters, combined with their analysis being performed at small scale. Simulating the Chinchilla study under these conditions produces biased scaling coefficients close to Kaplan s. Section 4 experimentally verifies our analysis by training a set of language models at tiny scale and conducting scaling law analyses under various settings. |
| Researcher Affiliation | Collaboration | Tim Pearce Microsoft Research Jinyeop Song MIT |
| Pseudocode | No | The paper includes Figure 1 as an "Overview of the approach used to reconcile the two studies." which is a flowchart, but no explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps in a code-like format. |
| Open Source Code | Yes | Code for analysis: https://github.com/Tea Pearce/Reconciling_Kaplan_Chinchilla_Scaling_Laws |
| Open Datasets | Yes | We trained five models of sizes, NT [0.8M, 1.6M, 2.1M, 3.3M, 4.6M] on the Book Corpus dataset. |
| Dataset Splits | No | The paper mentions using the "Book Corpus dataset" and describes training token budgets and batch sizes (e.g., "total training tokens D [262M, 262M, 262M, 524M, 524M]") for model training. However, it does not specify explicit train/test/validation splits for the dataset itself, only the total amount of tokens used for training. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions "numpy s polyfit" in Section 3, but does not provide specific version numbers for numpy or any other key software dependencies or libraries used in the analysis or experiments. |
| Experiment Setup | Yes | Models were trained for updates [4000, 4000, 4000, 8000, 8000], batchsize was 65,536 tokens per update, for total training tokens D [262M, 262M, 262M, 524M, 524M]. The best learning rate for each model size was chosen [0.001, 0.005, 0.01, 0.05] and no annealing was applied. |