Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Optimization Landscape of SGD Across the Feature Learning Strength

Authors: Alexander Atanasov, Alexandru Meterez, James Simon, Cengiz Pehlevan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we conduct a thorough empirical investigation of the effect of scaling γ across a variety of models and datasets in the online training setting. We first examine the interaction of γ with the learning rate η, identifying several scaling regimes in the γ-η plane which we explain theoretically using a simple model.
Researcher Affiliation Academia 1 Department of Physics, Harvard University 2 School of Engineering and Applied Science, Harvard University 3 Department of Physics, University of California, Berkeley 4 Center for Brain Science, Harvard University 5 Kempner Institute, Harvard University Imbue EMAIL
Pseudocode No The paper describes mathematical models and dynamics (e.g., in Section 4, "A SIMPLE MODEL EXPLAINING OBSERVED SCALINGS") and theoretical derivations of update rules (e.g., "The update rule for Sign SGD is θ 7 θ η sign ( θL)"), but it does not contain any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes CODE AVAILABILITY The code to reproduce all figures in this paper can be accessed at: https://github.com/Pehlevan-Group/Richness Sweep.
Open Datasets Yes We study networks trained on the datasets, MNIST, CIFAR and Tiny Image Net. Our motivation is to study networks training in the online setting over several orders of magnitude in time. To this end, we adopt larger versions of these datasets: MNIST-1M and CIFAR-5M, and apply strong data augmentation to Tiny Imagenet. We generate MNIST-1M using the denoising diffusion model (Ho et al., 2020) in Pearce (2022). We use CIFAR-5M from Nakkiran et al. (2021).
Dataset Splits Yes We train online on batches Bt = {xµ, yµ}B µ=1 batch with size B for t {1, . . . , T} time steps. The corresponding test set for this dataset is the original MNIST test set. The corresponding test set for this dataset is the original CIFAR-10 test set.
Hardware Specification Yes We evaluate all our experiments on A100 and H100 GPUs, using Py Torch.
Software Dependencies No The paper mentions "Py Torch" and "Py Hessian (Yao et al., 2020)" but does not provide specific version numbers for any software components, which is required for reproducibility.
Experiment Setup Yes We sweep jointly over every pair of η, γ in a log-spaced grid running from γ = 10 5 to 105. For each γ, we sweep from η = 1012 to η = 10 12 downwards in a log-spaced grid until the first convergent η is reached. We train online on batches Bt = {xµ, yµ}B µ=1 batch with size B for t {1, . . . , T} time steps. In all cases, we created a Centered Model class, ensuring that the output of the network is zero at initialization by making the following definition for the trained function f: γ [f(x, θ) f(x, θ0)] . All models are still trained with SGD.