Towards understanding how momentum improves generalization in deep learning

Authors: Samy Jelassi, Yuanzhi Li

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we adopt another perspective and first empirically show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. In Section 2, we empirically confirm that momentum consistently improves generalization when using different architectures on a wide range of batch sizes and datasets. We conducted extensive experiments on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). We used VGG-19 (Simonyan & Zisserman, 2014) and Resnet-18 (He et al., 2016) as architectures.
Researcher Affiliation Academia Samy Jelassi 1 Yuanzhi Li 2 1Princeton University, NJ, USA. 2Carnegie Mellon University, PA, USA.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a direct link to a code repository.
Open Datasets Yes We train a VGG-19 (Simonyan & Zisserman, 2014) using SGD, SGD+M, gradient descent (GD) and GD with momentum (GD+M) on the CIFAR-10 image classification task. To evaluate the contribution of momentum to generalization, we conducted extensive experiments on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009).
Dataset Splits No The paper specifies training and test sets ('The training dataset consists in 500 points in dimension 30 and test set in 5000 points.') but does not explicitly mention a validation set or its split. It states, 'The model is trained for 300 epochs to ensure zero training error,' implying training until convergence, but no validation split is detailed.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. It only describes the software setup and experimental parameters.
Software Dependencies No The paper does not provide specific software dependencies with version numbers. It mentions the use of VGG-19 and ResNet-18 architectures and optimizers (SGD, GD, momentum) but no software environment details like Python, PyTorch/TensorFlow versions, or specific libraries with their corresponding versions.
Experiment Setup Yes In all of our experiments, we refer to the stochastic gradient descent optimizer with batch size 128 as SGD/SGD+M and the optimizer with full batch size as GD/GD+M. We turn off data augmentation and batch normalization to isolate the contribution of momentum to the optimization. Note that for each algorithm, we grid-search over stepsizes and momentum parameter to find the best one in terms of test accuracy. We train the models for 300 epochs. The stepsize is decayed by a factor 10 at epochs 190 and 265 during training. All the results are averaged over 5 seeds. Lastly, the momentum factor is set to γ = 1 polylog(d) . We set the momentum parameter to 0.9. We apply a linear decay learning rate scheduling during training.