Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

Authors: Xuan Tang, Han Zhang, Yuan Cao, Difan Zou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate our findings, demonstrating the critical role of batch size and weight decay in Adam s generalization performance. D.1 Experimental Details for Real-world Data, D.2 Experimental Details for Synthetic Data
Researcher Affiliation Academia 1School of Computing and Data Science, The University of Hong Kong 2Institute of Data Science, The University of Hong Kong EMAIL, EMAIL
Pseudocode Yes Furthermore, motivated by the behavioral similarity between Adam and Sign GD when the learning rate is sufficiently small or β1, β2 approach zero (Balles and Hennig, 2018; Bernstein et al., 2018), we present results for Sign SGD. We subsequently extend these results to stochastic Adam, which provided in Appendix C. The update rules for Sign SGD are given as follows: (Sign SGD) w(t+1) j,r = w(t) j,r η sgn(g(t) t,j,r), (5.1)
Open Source Code No The justification for question 5 in the NeurIPS checklist states: "We use VGG16, Res Net18, Res Net50 models and the CIFAR-10, Image Net1K datasets, which are easily available on the Internet. All the experimental details are provided in Appendix D." This statement refers to the availability of models and datasets, not the authors' implementation code for the described methodology.
Open Datasets Yes We use VGG16, Res Net18, Res Net50 models and the CIFAR-10, Image Net1K datasets, which are easily available on the Internet.
Dataset Splits Yes For the real-world experiments in Figures 1 and 2, we use the CIFAR-10 dataset... Large-scale vision experiments with Res Net-50 on Image Net-1K subset, Figures 11 and 12. To further validate our theory, we conduct large-scale experiments on Image Net-1K. We construct a subset by randomly sampling 100 training images per class (seed=0), ensuring a controlled large-batch regime ( n/B = Θ(1)) while keeping computation feasible.
Hardware Specification Yes All experiments can be run within one hour on a single RTX 4090 GPU. The only exception is training Res Net18 with a batch size of 8192, which requires three GPUs due to memory constraints.
Software Dependencies No For the real-world experiments in Figures 1 and 2, we use the CIFAR-10 dataset, VGG16 and Res Net18 architectures, and the Adam and Adam W optimizers, all implemented in Py Torch. (No version specified for PyTorch)
Experiment Setup Yes In Figure 1, we report the test error as a function of batch size. The batch sizes considered are {16, 32, 64, 256, 1024, 4096, 8192}, with training conducted for 100 epochs. The weight decay is set to 5e-4 for Adam and 1e-2 for Adam W; the momentum parameters are fixed at (β1, β2) = (0.9, 0.99) for both optimizers. Each configuration is evaluated with three learning rates: {5e-4, 1e-4, 1e-5}, and we report the best test performance for each batch size. All synthetic experiments are trained for T = 10^4 epochs with a learning rate of η = 5e-5, and evaluated on a test dataset of size 10^4. For Adam and Adam W optimizers, we adopt the default momentum hyperparameters β1 = 0.9 and β2 = 0.999.