Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
The Optimization Landscape of SGD Across the Feature Learning Strength
Authors: Alexander Atanasov, Alexandru Meterez, James Simon, Cengiz Pehlevan
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we conduct a thorough empirical investigation of the effect of scaling γ across a variety of models and datasets in the online training setting. We first examine the interaction of γ with the learning rate η, identifying several scaling regimes in the γ-η plane which we explain theoretically using a simple model. |
| Researcher Affiliation | Academia | 1 Department of Physics, Harvard University 2 School of Engineering and Applied Science, Harvard University 3 Department of Physics, University of California, Berkeley 4 Center for Brain Science, Harvard University 5 Kempner Institute, Harvard University Imbue EMAIL |
| Pseudocode | No | The paper describes mathematical models and dynamics (e.g., in Section 4, "A SIMPLE MODEL EXPLAINING OBSERVED SCALINGS") and theoretical derivations of update rules (e.g., "The update rule for Sign SGD is θ 7 θ η sign ( θL)"), but it does not contain any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | CODE AVAILABILITY The code to reproduce all figures in this paper can be accessed at: https://github.com/Pehlevan-Group/Richness Sweep. |
| Open Datasets | Yes | We study networks trained on the datasets, MNIST, CIFAR and Tiny Image Net. Our motivation is to study networks training in the online setting over several orders of magnitude in time. To this end, we adopt larger versions of these datasets: MNIST-1M and CIFAR-5M, and apply strong data augmentation to Tiny Imagenet. We generate MNIST-1M using the denoising diffusion model (Ho et al., 2020) in Pearce (2022). We use CIFAR-5M from Nakkiran et al. (2021). |
| Dataset Splits | Yes | We train online on batches Bt = {xµ, yµ}B µ=1 batch with size B for t {1, . . . , T} time steps. The corresponding test set for this dataset is the original MNIST test set. The corresponding test set for this dataset is the original CIFAR-10 test set. |
| Hardware Specification | Yes | We evaluate all our experiments on A100 and H100 GPUs, using Py Torch. |
| Software Dependencies | No | The paper mentions "Py Torch" and "Py Hessian (Yao et al., 2020)" but does not provide specific version numbers for any software components, which is required for reproducibility. |
| Experiment Setup | Yes | We sweep jointly over every pair of η, γ in a log-spaced grid running from γ = 10 5 to 105. For each γ, we sweep from η = 1012 to η = 10 12 downwards in a log-spaced grid until the first convergent η is reached. We train online on batches Bt = {xµ, yµ}B µ=1 batch with size B for t {1, . . . , T} time steps. In all cases, we created a Centered Model class, ensuring that the output of the network is zero at initialization by making the following definition for the trained function f: γ [f(x, θ) f(x, θ0)] . All models are still trained with SGD. |