Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling Laws from the Data Manifold Dimension

Authors: Utkarsh Sharma, Jared Kaplan

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conﬁrm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a variety of d and α by dialing the properties of random teacher networks. We also test the theory with CNN image classiﬁers on several datasets and with GPT-type language models.
Researcher Affiliation	Academia	Utkarsh Sharma EMAIL Department of Physics and Astronomy Johns Hopkins University Baltimore, MD 21218, USA Jared Kaplan EMAIL Department of Physics and Astronomy Johns Hopkins University Baltimore, MD 21218, USA
Pseudocode	No	The paper describes methods and experiments in detail but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code for our experiments will be available at: https://github.com/U-Sharma/Neural Scale ID
Open Datasets	Yes	We also test the theory with CNN image classiﬁers on several datasets and with GPT-type language models... So we used a version of the default tutorial CNN in tensorﬂow Abadi et al. (2015), which we modiﬁed only by scaling the number of channels (ie the width). Figure 6 shows the scaling of the test loss with number of parameters N... We performed a very similar analysis on the MNIST Le Cun and Cortes (2010), fashion MNIST Xiao et al. (2017), and SVHN Netzer et al. (2011) datasets using slightly smaller networks (see section A.4)... The GPT-type language models display power-law scaling of L(N) over at least ﬁve orders of magnitude in N, with exponent α 0.076 Kaplan et al. (2020).
Dataset Splits	No	For CIFAR10 we used the architecture from the tensorﬂow CNN tutorial Abadi et al. (2015), and modiﬁed the channel width... Figure 6 shows the scaling of the test loss with number of parameters N... The left ﬁgure shows the test and training loss L(N) for various sizes of CNN trained on CIFAR10... We performed a very similar analysis on the MNIST Le Cun and Cortes (2010), fashion MNIST Xiao et al. (2017), and SVHN Netzer et al. (2011) datasets... For MNIST and fashion MNIST, we ran each network for 20 trials and took the mean loss (on log scale). The networks were trained for 50 epochs... For SVHN, the networks were trained for 5 epochs with both training and additional datasets used for training (total 604k images), and test dataset (26k images) for testing. While the paper discusses training and test losses for these datasets, it does not provide specific details on how the datasets were split, such as percentages, sample counts, or explicit references to predefined split methodologies.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies	No	So we used a version of the default tutorial CNN in tensorﬂow Abadi et al. (2015)... We use the ADAM optimizer Kingma and Ba (2014) with default settings except for the learning rate. The paper mentions TensorFlow and the ADAM optimizer but does not specify their version numbers or any other software dependencies with version information.
Experiment Setup	Yes	We use the ADAM optimizer Kingma and Ba (2014) with default settings except for the learning rate. In order to optimize eﬀectively, we scanned over a grid of learning rates, and experimented with cosine, linear, and step-function learning rate schedules. We ended up using step function schedules for teacher/student experiments, and a constant learning rate for CIFAR10 and other image datasets... Our learning rate schedules for the various teacher/student experiments in the paper (labeled by associated ﬁgures) are summarized in table 1. Experiment student training steps batch size learning rate (T/S) architecture (ADAM) (random) MSE: [20,n,n,1] 0-200k 200 0.01 ﬁgures 7, 11, 5 CE: [20,n,n,2] 200-220k 1000 0.01 220-240k 4000 0.001 (vetted) 0-100k 200 0.01 ﬁgure 14 [9,n,n,2] 100-150k 200 0.001 150-170k 200 0.0001 Table 1: Architectures and training schedules for Teacher/Student experiments in the paper, referenced by the ﬁgures in which the results are described. For CIFAR10 we used the architecture from the tensorﬂow CNN tutorial Abadi et al. (2015)... The networks were trained for 50 epochs with the ADAM optimizer with default hyperparameters.