Broken Neural Scaling Laws
Authors: Ethan Caballero, Kshitij Gupta, Irina Rish, David Krueger
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a smoothly broken power law functional form (referred to by us as a broken neural scaling law (BNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. ... An extensive empirical evaluation demonstrates that BNSL accurately model and extrapolate the scaling behaviors for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. |
| Researcher Affiliation | Academia | Ethan Caballero Mila, Mc Gill University ethan.victor.caballero@gmail.com ethan.caballero@mila.quebec Kshitij Gupta Mila, University of Montreal Irina Rish Mila, University of Montreal David Krueger University of Cambridge |
| Pseudocode | No | The paper does not include any pseudocode or algorithm blocks. It describes mathematical equations and experimental setups in text and tables. |
| Open Source Code | Yes | Code is available at https: //github.com/ethancaballero/broken_neural_scaling_laws |
| Open Datasets | Yes | Using the scaling laws benchmark of Alabdulmohsin et al. (2022), we evaluate how well various functional forms extrapolate performance on vision tasks as training dataset size increases. ... The downstream tasks are: Birds 200 (Welinder et al., 2010), Caltech101 (Fei-Fei et al., 2004), CIFAR-100 (Krizhevsky et al., 2009), and Image Net (Deng et al., 2009). |
| Dataset Splits | No | The paper mentions 'black points are points used for fitting a functional form, green points are the held-out points used for evaluating extrapolation of functional form fit to the black points'. While it discusses training and held-out points, it does not explicitly define or specify a 'validation' dataset split with percentages or counts. |
| Hardware Specification | Yes | Each experiment was run on a single V100 GPU and each run took less than 2 hours. |
| Software Dependencies | No | The paper mentions 'Sci Py curve-fitting library (Virtanen et al., 2020)' and 'matplotlib (Hunter, 2007)', and 'min GPT implementation (Karpathy, 2020)'. While libraries are named, specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | For our experiments we train the transformer model using the following set of hyperparameters: Dmodel 128 DMLP 512 Number of heads 2 Number of transformer blocks (i.e. layers) 1 Learning rate 0.0001 Weight Decay 0.1 Dropout Probability 0.0 Dataset sizes 144-1008 Vocab Size 10 |