Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Laws from the Data Manifold Dimension

Authors: Utkarsh Sharma, Jared Kaplan

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a variety of d and α by dialing the properties of random teacher networks. We also test the theory with CNN image classifiers on several datasets and with GPT-type language models.
Researcher Affiliation Academia Utkarsh Sharma EMAIL Department of Physics and Astronomy Johns Hopkins University Baltimore, MD 21218, USA Jared Kaplan EMAIL Department of Physics and Astronomy Johns Hopkins University Baltimore, MD 21218, USA
Pseudocode No The paper describes methods and experiments in detail but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes Code for our experiments will be available at: https://github.com/U-Sharma/Neural Scale ID
Open Datasets Yes We also test the theory with CNN image classifiers on several datasets and with GPT-type language models... So we used a version of the default tutorial CNN in tensorflow Abadi et al. (2015), which we modified only by scaling the number of channels (ie the width). Figure 6 shows the scaling of the test loss with number of parameters N... We performed a very similar analysis on the MNIST Le Cun and Cortes (2010), fashion MNIST Xiao et al. (2017), and SVHN Netzer et al. (2011) datasets using slightly smaller networks (see section A.4)... The GPT-type language models display power-law scaling of L(N) over at least five orders of magnitude in N, with exponent α 0.076 Kaplan et al. (2020).
Dataset Splits No For CIFAR10 we used the architecture from the tensorflow CNN tutorial Abadi et al. (2015), and modified the channel width... Figure 6 shows the scaling of the test loss with number of parameters N... The left figure shows the test and training loss L(N) for various sizes of CNN trained on CIFAR10... We performed a very similar analysis on the MNIST Le Cun and Cortes (2010), fashion MNIST Xiao et al. (2017), and SVHN Netzer et al. (2011) datasets... For MNIST and fashion MNIST, we ran each network for 20 trials and took the mean loss (on log scale). The networks were trained for 50 epochs... For SVHN, the networks were trained for 5 epochs with both training and additional datasets used for training (total 604k images), and test dataset (26k images) for testing. While the paper discusses training and test losses for these datasets, it does not provide specific details on how the datasets were split, such as percentages, sample counts, or explicit references to predefined split methodologies.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies No So we used a version of the default tutorial CNN in tensorflow Abadi et al. (2015)... We use the ADAM optimizer Kingma and Ba (2014) with default settings except for the learning rate. The paper mentions TensorFlow and the ADAM optimizer but does not specify their version numbers or any other software dependencies with version information.
Experiment Setup Yes We use the ADAM optimizer Kingma and Ba (2014) with default settings except for the learning rate. In order to optimize effectively, we scanned over a grid of learning rates, and experimented with cosine, linear, and step-function learning rate schedules. We ended up using step function schedules for teacher/student experiments, and a constant learning rate for CIFAR10 and other image datasets... Our learning rate schedules for the various teacher/student experiments in the paper (labeled by associated figures) are summarized in table 1. Experiment student training steps batch size learning rate (T/S) architecture (ADAM) (random) MSE: [20,n,n,1] 0-200k 200 0.01 figures 7, 11, 5 CE: [20,n,n,2] 200-220k 1000 0.01 220-240k 4000 0.001 (vetted) 0-100k 200 0.01 figure 14 [9,n,n,2] 100-150k 200 0.001 150-170k 200 0.0001 Table 1: Architectures and training schedules for Teacher/Student experiments in the paper, referenced by the figures in which the results are described. For CIFAR10 we used the architecture from the tensorflow CNN tutorial Abadi et al. (2015)... The networks were trained for 50 epochs with the ADAM optimizer with default hyperparameters.