4+3 Phases of Compute-Optimal Neural Scaling Laws

Authors: Elliot Paquette, Courtney Paquette, Lechao Xiao, Jeffrey Pennington

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We furthermore derive, with mathematical proof and extensive numerical evidence, the scalinglaw exponents in all of these phases, in particular computing the optimal modelparameter-count as a function of floating point operation budget. We include a colab notebook nano Chinchilla3 that reproduces some key results of the paper. J Experimental Results To measure the exponents of the scaling law and parameter count, we follow approach12 1 and 2 from [24].
Researcher Affiliation Collaboration Elliot Paquette Mc Gill University elliot.paquette@mcgill.ca Courtney Paquette Google Deep Mind & Mc Gill University courtney.paquette@mcgill.ca Lechao Xiao Google Deep Mind Jeffrey Pennington Google Deep Mind
Pseudocode No The paper contains mathematical derivations and equations, but no explicitly labeled "Pseudocode" or "Algorithm" blocks, nor any structured code-like procedures.
Open Source Code Yes We include a colab notebook nano Chinchilla3 that reproduces some key results of the paper.
Open Datasets No The samples x Rv and labels b Rv have power law dependence, whereas the matrix W has entries distributed as N(0, 1/d). No specific public dataset is used; data is synthetically generated according to described distributions.
Dataset Splits No The paper generates synthetic data for its experiments and does not describe explicit training, validation, or test dataset splits. The terms 'train' and 'test' refer to the model training process rather than data partitioning.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory used for running the experiments. It only mentions 'flops' as a measure of compute, stating 'the compute resources are also well known in by the community.'
Software Dependencies No The paper does not explicitly provide a list of software dependencies with specific version numbers. While a Colab notebook is mentioned, no detailed software environment is specified in the text.
Experiment Setup Yes First, we run SGD for parameter counts d [200, 300, 400, 600, 800, 1200, 1600, 2400, 3200, 4800, 6400, 9600, 12800]. The SGD learning curves for (α, β) = (0.5, 0.7) with parameters d [800, 1600, 3200, 6400, 12800] are shown in Fig. 9a. To solve the minimization problem in (5), we use one-pass SGD with minibatches of size B (independent of d)9 and constant learning rate γ > 0.