reproducibilityindex.ai

A Dynamical Model of Neural Scaling Laws

Authors: Blake Bordelon, Alexander Atanasov, Cengiz Pehlevan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinitewidth dynamics at a rate 1/width but at late time exhibit a rate width c, where c depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data. We demonstrate scaling laws on simple vision and language tasks in Figure 1. We empirically study this phenomenon in Appendix L.
Researcher Affiliation	Academia	Blake Bordelon 1 2 Alexander Atanasov 3 2 Cengiz Pehlevan 1 2 1SEAS, Harvard University 2Kempner Institute, Harvard University 3Department of Physics, Harvard University.
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not include any explicit statements about releasing source code or provide links to a code repository.
Open Datasets	Yes	We take the CIFAR-5M dataset introduced in (Nakkiran et al., 2021a) and consider the task of classfiying animate vs inanimate objects. Transformer training on wikitext with 100M tokens before data-repetition.
Dataset Splits	No	The paper mentions using CIFAR-5M and Wikitext datasets for evaluation but does not provide specific training, validation, or test split percentages or counts. It only refers to 'Test Dynamics' and 'Train and Test' without detailing the partitioning.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. It mentions training models but not the computational resources.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers for replication (e.g., 'Python 3.8, PyTorch 1.9').
Experiment Setup	Yes	We train several networks with different widths and initialization seeds for 64 epochs through the dataset.