reproducibilityindex.ai

4+3 Phases of Compute-Optimal Neural Scaling Laws

Authors: Elliot Paquette, Courtney Paquette, Lechao Xiao, Jeffrey Pennington

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We furthermore derive, with mathematical proof and extensive numerical evidence, the scalinglaw exponents in all of these phases, in particular computing the optimal modelparameter-count as a function of ﬂoating point operation budget. We include a colab notebook nano Chinchilla3 that reproduces some key results of the paper. J Experimental Results To measure the exponents of the scaling law and parameter count, we follow approach12 1 and 2 from [24].
Researcher Affiliation	Collaboration	Elliot Paquette Mc Gill University elliot.paquette@mcgill.ca Courtney Paquette Google Deep Mind & Mc Gill University courtney.paquette@mcgill.ca Lechao Xiao Google Deep Mind Jeffrey Pennington Google Deep Mind
Pseudocode	No	The paper contains mathematical derivations and equations, but no explicitly labeled "Pseudocode" or "Algorithm" blocks, nor any structured code-like procedures.
Open Source Code	Yes	We include a colab notebook nano Chinchilla3 that reproduces some key results of the paper.
Open Datasets	No	The samples x Rv and labels b Rv have power law dependence, whereas the matrix W has entries distributed as N(0, 1/d). No specific public dataset is used; data is synthetically generated according to described distributions.
Dataset Splits	No	The paper generates synthetic data for its experiments and does not describe explicit training, validation, or test dataset splits. The terms 'train' and 'test' refer to the model training process rather than data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory used for running the experiments. It only mentions 'ﬂops' as a measure of compute, stating 'the compute resources are also well known in by the community.'
Software Dependencies	No	The paper does not explicitly provide a list of software dependencies with specific version numbers. While a Colab notebook is mentioned, no detailed software environment is specified in the text.
Experiment Setup	Yes	First, we run SGD for parameter counts d [200, 300, 400, 600, 800, 1200, 1600, 2400, 3200, 4800, 6400, 9600, 12800]. The SGD learning curves for (α, β) = (0.5, 0.7) with parameters d [800, 1600, 3200, 6400, 12800] are shown in Fig. 9a. To solve the minimization problem in (5), we use one-pass SGD with minibatches of size B (independent of d)9 and constant learning rate γ > 0.