Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Emergence and scaling laws in SGD learning of shallow neural networks

Authors: Yunwei Ren, Eshaan Nichani, Denny Wu, Jason D. Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective, and identify sharp transition times to recover each signal direction. Figure 2: Theoretical and empirical risk curves with Ξ²= 0.8. (a) Idealized scaling curves described in Section 3.1. (b) Empirical scaling curve of GD training on the population loss with d= 2048, P= 1024.
Researcher Affiliation Academia 1Princeton University, 2New York University, 3Flatiron Institute EMAIL, EMAIL
Pseudocode No The paper describes the algorithms and methods using mathematical notation and prose, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] . Justification: This is a theory paper; toy experiments are conducted on Gaussian data.
Open Datasets No We study the complexity of online stochastic gradient descent (SGD) for learning a two-layer neural network with Pneurons on isotropic Gaussian data: f (x) = PP p=1 ap Οƒ(x, vp ), x N (0, Id). Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] . Justification: This is a theory paper; toy experiments are conducted on Gaussian data.
Dataset Splits No We set d= 2048, P= 1024, Οƒ= h4, and vary the student width. This section describes the parameters used for generating synthetic data and for simulations, but it does not specify any training, validation, or test dataset splits.
Hardware Specification No Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] . Justification: This is a theory paper; toy experiments are conducted on Gaussian data.
Software Dependencies No Question: Does the paper provide SPECIFIC ANCILLARY SOFTWARE DETAILS (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment? Answer: [No] . Justification: This is a theory paper; toy experiments are conducted on Gaussian data.
Experiment Setup Yes We use online stochastic gradient descent (SGD) to train the learner model. Let Ξ·> 0 be the step size. At each step, we update the neurons using vanilla gradient descent: vπ‘˜(𝑑+ 1) = vπ‘˜(𝑑) Ξ· vπ‘˜π‘™(x𝑑), for all π‘˜ [π‘š], where 𝑙is the per-sample loss defined in (3). We initialize the student neurons vπ‘˜ Unif(S𝑑 1(𝜎0)), where 𝜎0 = 1/poly(𝑑) is a parameter we specify in the sequel. In Figure 2, we plot... the MSE loss curves for GD training (with fixed step size) on the population loss, where we set d= 2048, P= 1024, Οƒ= h4, and vary the student width.