Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards understanding how momentum improves generalization in deep learning
Authors: Samy Jelassi, Yuanzhi Li
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we adopt another perspective and first empirically show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. In Section 2, we empirically confirm that momentum consistently improves generalization when using different architectures on a wide range of batch sizes and datasets. We conducted extensive experiments on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). We used VGG-19 (Simonyan & Zisserman, 2014) and Resnet-18 (He et al., 2016) as architectures. |
| Researcher Affiliation | Academia | Samy Jelassi 1 Yuanzhi Li 2 1Princeton University, NJ, USA. 2Carnegie Mellon University, PA, USA. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a direct link to a code repository. |
| Open Datasets | Yes | We train a VGG-19 (Simonyan & Zisserman, 2014) using SGD, SGD+M, gradient descent (GD) and GD with momentum (GD+M) on the CIFAR-10 image classification task. To evaluate the contribution of momentum to generalization, we conducted extensive experiments on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). |
| Dataset Splits | No | The paper specifies training and test sets ('The training dataset consists in 500 points in dimension 30 and test set in 5000 points.') but does not explicitly mention a validation set or its split. It states, 'The model is trained for 300 epochs to ensure zero training error,' implying training until convergence, but no validation split is detailed. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. It only describes the software setup and experimental parameters. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. It mentions the use of VGG-19 and ResNet-18 architectures and optimizers (SGD, GD, momentum) but no software environment details like Python, PyTorch/TensorFlow versions, or specific libraries with their corresponding versions. |
| Experiment Setup | Yes | In all of our experiments, we refer to the stochastic gradient descent optimizer with batch size 128 as SGD/SGD+M and the optimizer with full batch size as GD/GD+M. We turn off data augmentation and batch normalization to isolate the contribution of momentum to the optimization. Note that for each algorithm, we grid-search over stepsizes and momentum parameter to find the best one in terms of test accuracy. We train the models for 300 epochs. The stepsize is decayed by a factor 10 at epochs 190 and 265 during training. All the results are averaged over 5 seeds. Lastly, the momentum factor is set to γ = 1 polylog(d) . We set the momentum parameter to 0.9. We apply a linear decay learning rate scheduling during training. |