4+3 Phases of Compute-Optimal Neural Scaling Laws
Authors: Elliot Paquette, Courtney Paquette, Lechao Xiao, Jeffrey Pennington
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We furthermore derive, with mathematical proof and extensive numerical evidence, the scalinglaw exponents in all of these phases, in particular computing the optimal modelparameter-count as a function of floating point operation budget. We include a colab notebook nano Chinchilla3 that reproduces some key results of the paper. J Experimental Results To measure the exponents of the scaling law and parameter count, we follow approach12 1 and 2 from [24]. |
| Researcher Affiliation | Collaboration | Elliot Paquette Mc Gill University elliot.paquette@mcgill.ca Courtney Paquette Google Deep Mind & Mc Gill University courtney.paquette@mcgill.ca Lechao Xiao Google Deep Mind Jeffrey Pennington Google Deep Mind |
| Pseudocode | No | The paper contains mathematical derivations and equations, but no explicitly labeled "Pseudocode" or "Algorithm" blocks, nor any structured code-like procedures. |
| Open Source Code | Yes | We include a colab notebook nano Chinchilla3 that reproduces some key results of the paper. |
| Open Datasets | No | The samples x Rv and labels b Rv have power law dependence, whereas the matrix W has entries distributed as N(0, 1/d). No specific public dataset is used; data is synthetically generated according to described distributions. |
| Dataset Splits | No | The paper generates synthetic data for its experiments and does not describe explicit training, validation, or test dataset splits. The terms 'train' and 'test' refer to the model training process rather than data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory used for running the experiments. It only mentions 'flops' as a measure of compute, stating 'the compute resources are also well known in by the community.' |
| Software Dependencies | No | The paper does not explicitly provide a list of software dependencies with specific version numbers. While a Colab notebook is mentioned, no detailed software environment is specified in the text. |
| Experiment Setup | Yes | First, we run SGD for parameter counts d [200, 300, 400, 600, 800, 1200, 1600, 2400, 3200, 4800, 6400, 9600, 12800]. The SGD learning curves for (α, β) = (0.5, 0.7) with parameters d [800, 1600, 3200, 6400, 12800] are shown in Fig. 9a. To solve the minimization problem in (5), we use one-pass SGD with minibatches of size B (independent of d)9 and constant learning rate γ > 0. |