A Dynamical Model of Neural Scaling Laws

Authors: Blake Bordelon, Alexander Atanasov, Cengiz Pehlevan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinitewidth dynamics at a rate 1/width but at late time exhibit a rate width c, where c depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data. We demonstrate scaling laws on simple vision and language tasks in Figure 1. We empirically study this phenomenon in Appendix L.
Researcher Affiliation Academia Blake Bordelon 1 2 Alexander Atanasov 3 2 Cengiz Pehlevan 1 2 1SEAS, Harvard University 2Kempner Institute, Harvard University 3Department of Physics, Harvard University.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not include any explicit statements about releasing source code or provide links to a code repository.
Open Datasets Yes We take the CIFAR-5M dataset introduced in (Nakkiran et al., 2021a) and consider the task of classfiying animate vs inanimate objects. Transformer training on wikitext with 100M tokens before data-repetition.
Dataset Splits No The paper mentions using CIFAR-5M and Wikitext datasets for evaluation but does not provide specific training, validation, or test split percentages or counts. It only refers to 'Test Dynamics' and 'Train and Test' without detailing the partitioning.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. It mentions training models but not the computational resources.
Software Dependencies No The paper does not provide specific software dependencies with version numbers for replication (e.g., 'Python 3.8, PyTorch 1.9').
Experiment Setup Yes We train several networks with different widths and initialization seeds for 64 epochs through the dataset.