Towards understanding how momentum improves generalization in deep learning
Authors: Samy Jelassi, Yuanzhi Li
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we adopt another perspective and first empirically show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. In Section 2, we empirically confirm that momentum consistently improves generalization when using different architectures on a wide range of batch sizes and datasets. We conducted extensive experiments on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). We used VGG-19 (Simonyan & Zisserman, 2014) and Resnet-18 (He et al., 2016) as architectures. |
| Researcher Affiliation | Academia | Samy Jelassi 1 Yuanzhi Li 2 1Princeton University, NJ, USA. 2Carnegie Mellon University, PA, USA. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a direct link to a code repository. |
| Open Datasets | Yes | We train a VGG-19 (Simonyan & Zisserman, 2014) using SGD, SGD+M, gradient descent (GD) and GD with momentum (GD+M) on the CIFAR-10 image classification task. To evaluate the contribution of momentum to generalization, we conducted extensive experiments on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). |
| Dataset Splits | No | The paper specifies training and test sets ('The training dataset consists in 500 points in dimension 30 and test set in 5000 points.') but does not explicitly mention a validation set or its split. It states, 'The model is trained for 300 epochs to ensure zero training error,' implying training until convergence, but no validation split is detailed. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. It only describes the software setup and experimental parameters. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. It mentions the use of VGG-19 and ResNet-18 architectures and optimizers (SGD, GD, momentum) but no software environment details like Python, PyTorch/TensorFlow versions, or specific libraries with their corresponding versions. |
| Experiment Setup | Yes | In all of our experiments, we refer to the stochastic gradient descent optimizer with batch size 128 as SGD/SGD+M and the optimizer with full batch size as GD/GD+M. We turn off data augmentation and batch normalization to isolate the contribution of momentum to the optimization. Note that for each algorithm, we grid-search over stepsizes and momentum parameter to find the best one in terms of test accuracy. We train the models for 300 epochs. The stepsize is decayed by a factor 10 at epochs 190 and 265 during training. All the results are averaged over 5 seeds. Lastly, the momentum factor is set to γ = 1 polylog(d) . We set the momentum parameter to 0.9. We apply a linear decay learning rate scheduling during training. |