Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

Authors: Jingzhao Zhang, Tianxing He, Suvrit Sra, Ali Jadbabaie

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is strictly weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, gradient clipping and normalized gradient, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings.
Researcher Affiliation Academia Jingzhao Zhang, Tianxing He, Suvrit Sra & Ali Jadbabaie Massachusetts Institute of Technology Cambridge, MA 02139, USA {jzhzhang, tianxing, suvrit, jadbabai}@mit.edu
Pseudocode No The algorithms are described by mathematical equations (4, 5, 6), but not in a formal pseudocode or algorithm block.
Open Source Code Yes Part of the code is available at https://github.com/Jingzhao Zhang/ why-clipping-accelerates
Open Datasets Yes We run language modeling on the Penn Treebank (PTB) (Mikolov et al., 2010) dataset... We train Res Net20 (He et al., 2016) on the Cifar10 dataset (Krizhevsky and Hinton, 2009).
Dataset Splits Yes It has a vocabulary of size 10k, and 887k/70k/78k words for training/validation/testing.
Hardware Specification No The paper does not specify any particular hardware (GPU models, CPU, memory) used for running the experiments.
Software Dependencies No The paper mentions using specific models like AWD-LSTM and ResNet20, but it does not provide version numbers for any underlying software libraries (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup Yes To train the LSTM LM, we follow the training recipe from 3 (Merity et al., 2018). The model is a 3-layer LSTM LM with hidden size of 1150 and embedding size of 400. Dropout (Srivastava et al., 2014) of rate 0.4 and Drop Connect (Wan et al., 2013) of rate 0.5 is applied. For optimization, clipped SGD with clip value of 0.25 and a learning rate of 30 is used, and the model is trained for 500 epochs. ... Our baseline algorithm runs SGD momentum with learning rate 0.1, momentum 0.9 for 200 epochs. We choose weight decay to be 5e-4. The learning rate is reduced by 10 at epoch 100 and 150.