Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Broken Neural Scaling Laws

Authors: Ethan Caballero, Kshitij Gupta, Irina Rish, David Krueger

ICLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a smoothly broken power law functional form (referred to by us as a broken neural scaling law (BNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. ... An extensive empirical evaluation demonstrates that BNSL accurately model and extrapolate the scaling behaviors for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings.
Researcher Affiliation	Academia	Ethan Caballero Mila, Mc Gill University EMAIL EMAIL Kshitij Gupta Mila, University of Montreal Irina Rish Mila, University of Montreal David Krueger University of Cambridge
Pseudocode	No	The paper does not include any pseudocode or algorithm blocks. It describes mathematical equations and experimental setups in text and tables.
Open Source Code	Yes	Code is available at https: //github.com/ethancaballero/broken_neural_scaling_laws
Open Datasets	Yes	Using the scaling laws benchmark of Alabdulmohsin et al. (2022), we evaluate how well various functional forms extrapolate performance on vision tasks as training dataset size increases. ... The downstream tasks are: Birds 200 (Welinder et al., 2010), Caltech101 (Fei-Fei et al., 2004), CIFAR-100 (Krizhevsky et al., 2009), and Image Net (Deng et al., 2009).
Dataset Splits	No	The paper mentions 'black points are points used for fitting a functional form, green points are the held-out points used for evaluating extrapolation of functional form fit to the black points'. While it discusses training and held-out points, it does not explicitly define or specify a 'validation' dataset split with percentages or counts.
Hardware Specification	Yes	Each experiment was run on a single V100 GPU and each run took less than 2 hours.
Software Dependencies	No	The paper mentions 'Sci Py curve-fitting library (Virtanen et al., 2020)' and 'matplotlib (Hunter, 2007)', and 'min GPT implementation (Karpathy, 2020)'. While libraries are named, specific version numbers for these software components are not provided.
Experiment Setup	Yes	For our experiments we train the transformer model using the following set of hyperparameters: Dmodel 128 DMLP 512 Number of heads 2 Number of transformer blocks (i.e. layers) 1 Learning rate 0.0001 Weight Decay 0.1 Dropout Probability 0.0 Dataset sizes 144-1008 Vocab Size 10