Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Reconciling Kaplan and Chinchilla Scaling Laws
Authors: Tim Pearce, Jinyeop Song
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper finds that much of this discrepancy can be attributed to Kaplan counting non-embedding rather than total parameters, combined with their analysis being performed at small scale. Simulating the Chinchilla study under these conditions produces biased scaling coefficients close to Kaplan s. Section 4 experimentally verifies our analysis by training a set of language models at tiny scale and conducting scaling law analyses under various settings. |
| Researcher Affiliation | Collaboration | Tim Pearce Microsoft Research Jinyeop Song MIT |
| Pseudocode | No | The paper includes Figure 1 as an "Overview of the approach used to reconcile the two studies." which is a flowchart, but no explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps in a code-like format. |
| Open Source Code | Yes | Code for analysis: https://github.com/Tea Pearce/Reconciling_Kaplan_Chinchilla_Scaling_Laws |
| Open Datasets | Yes | We trained five models of sizes, NT [0.8M, 1.6M, 2.1M, 3.3M, 4.6M] on the Book Corpus dataset. |
| Dataset Splits | No | The paper mentions using the "Book Corpus dataset" and describes training token budgets and batch sizes (e.g., "total training tokens D [262M, 262M, 262M, 524M, 524M]") for model training. However, it does not specify explicit train/test/validation splits for the dataset itself, only the total amount of tokens used for training. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions "numpy s polyfit" in Section 3, but does not provide specific version numbers for numpy or any other key software dependencies or libraries used in the analysis or experiments. |
| Experiment Setup | Yes | Models were trained for updates [4000, 4000, 4000, 8000, 8000], batchsize was 65,536 tokens per update, for total training tokens D [262M, 262M, 262M, 524M, 524M]. The best learning rate for each model size was chosen [0.001, 0.005, 0.01, 0.05] and no annealing was applied. |