A Dynamical Model of Neural Scaling Laws
Authors: Blake Bordelon, Alexander Atanasov, Cengiz Pehlevan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, our model makes a prediction about why the scaling of performance with training time and with model size have different power law exponents. Consequently, the theory predicts an asymmetric compute-optimal scaling rule where the number of training steps are increased faster than model parameters, consistent with recent empirical observations. Second, it has been observed that early in training, networks converge to their infinitewidth dynamics at a rate 1/width but at late time exhibit a rate width c, where c depends on the structure of the architecture and task. We show that our model exhibits this behavior. Lastly, our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data. We demonstrate scaling laws on simple vision and language tasks in Figure 1. We empirically study this phenomenon in Appendix L. |
| Researcher Affiliation | Academia | Blake Bordelon 1 2 Alexander Atanasov 3 2 Cengiz Pehlevan 1 2 1SEAS, Harvard University 2Kempner Institute, Harvard University 3Department of Physics, Harvard University. |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include any explicit statements about releasing source code or provide links to a code repository. |
| Open Datasets | Yes | We take the CIFAR-5M dataset introduced in (Nakkiran et al., 2021a) and consider the task of classfiying animate vs inanimate objects. Transformer training on wikitext with 100M tokens before data-repetition. |
| Dataset Splits | No | The paper mentions using CIFAR-5M and Wikitext datasets for evaluation but does not provide specific training, validation, or test split percentages or counts. It only refers to 'Test Dynamics' and 'Train and Test' without detailing the partitioning. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. It mentions training models but not the computational resources. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers for replication (e.g., 'Python 3.8, PyTorch 1.9'). |
| Experiment Setup | Yes | We train several networks with different widths and initialization seeds for 64 epochs through the dataset. |