Special Properties of Gradient Descent with Large Learning Rates

Authors: Amirkeivan Mohtashami, Martin Jaggi, Sebastian U Stich

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance. We demonstrate the same effects also in the noise-less case, i.e. for full-batch GD. We formally prove that GD with large step size on certain non-convex function classes follows a different trajectory than GD with a small step size, which can lead to convergence to a global minimum instead of a local one.
Researcher Affiliation Academia 1EPFL, Switzerland 2CISPA, Germany. Correspondence to: Amirkeivan Mohtashami <amirkeivan.mohtashami@epfl.ch>.
Pseudocode Yes For completeness, we provide a pseudo code in the Appendix A, Algorithm 1.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for their methodology is made publicly available.
Open Datasets Yes In our experiments we train a Res Net-18 (He et al., 2016) without batch normalization on CIFAR10 (Krizhevsky & Hinton, 2009) dataset.
Dataset Splits No The paper uses CIFAR10 and CIFAR100 datasets and evaluates 'Test Accuracy' and 'Train Accuracy', but it does not explicitly provide the specific percentages or counts for training, validation, and test dataset splits.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instances) used to run the experiments.
Software Dependencies No The paper does not provide specific software dependencies or their version numbers (e.g., programming languages, libraries, or frameworks with versions) used for the experiments.
Experiment Setup Yes We apply 0.0005 weight decay, 0.9 momentum, and decay the learning rate at epochs 80, 120, and 160 by 0.1. When training with standard SGD and learning rate 0.001 we train the model for 10 times more epochs (2000 epochs).