Broken Neural Scaling Laws

Authors: Ethan Caballero, Kshitij Gupta, Irina Rish, David Krueger

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a smoothly broken power law functional form (referred to by us as a broken neural scaling law (BNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. ... An extensive empirical evaluation demonstrates that BNSL accurately model and extrapolate the scaling behaviors for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings.
Researcher Affiliation Academia Ethan Caballero Mila, Mc Gill University ethan.victor.caballero@gmail.com ethan.caballero@mila.quebec Kshitij Gupta Mila, University of Montreal Irina Rish Mila, University of Montreal David Krueger University of Cambridge
Pseudocode No The paper does not include any pseudocode or algorithm blocks. It describes mathematical equations and experimental setups in text and tables.
Open Source Code Yes Code is available at https: //github.com/ethancaballero/broken_neural_scaling_laws
Open Datasets Yes Using the scaling laws benchmark of Alabdulmohsin et al. (2022), we evaluate how well various functional forms extrapolate performance on vision tasks as training dataset size increases. ... The downstream tasks are: Birds 200 (Welinder et al., 2010), Caltech101 (Fei-Fei et al., 2004), CIFAR-100 (Krizhevsky et al., 2009), and Image Net (Deng et al., 2009).
Dataset Splits No The paper mentions 'black points are points used for fitting a functional form, green points are the held-out points used for evaluating extrapolation of functional form fit to the black points'. While it discusses training and held-out points, it does not explicitly define or specify a 'validation' dataset split with percentages or counts.
Hardware Specification Yes Each experiment was run on a single V100 GPU and each run took less than 2 hours.
Software Dependencies No The paper mentions 'Sci Py curve-fitting library (Virtanen et al., 2020)' and 'matplotlib (Hunter, 2007)', and 'min GPT implementation (Karpathy, 2020)'. While libraries are named, specific version numbers for these software components are not provided.
Experiment Setup Yes For our experiments we train the transformer model using the following set of hyperparameters: Dmodel 128 DMLP 512 Number of heads 2 Number of transformer blocks (i.e. layers) 1 Learning rate 0.0001 Weight Decay 0.1 Dropout Probability 0.0 Dataset sizes 144-1008 Vocab Size 10