A Loss Curvature Perspective on Training Instabilities of Deep Learning Models

Authors: Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Edward Dahl, Zachary Nado, Orhan Firat

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we design a series of large scale experiments studying the evolution of the loss sharpness as we vary the learning rate, warmup period, initialization, and architectural choices.
Researcher Affiliation Industry Justin Gilmer Behrooz Ghorbani Ankush Garg Sneha Kudugunta Behnam Neyshabur David Cardoze George E. Dahl Zack Nado Orhan Firat Correspondence to {gilmer, ghorbani}@google.com.
Pseudocode No The paper describes methods like the Lanczos method for Hessian estimation but does not include any formal pseudocode blocks or algorithms.
Open Source Code No The paper references and uses "the open sourced code of Zhu et al. (2021)" and "the Gradinit codebase" from others, including a link to https://github.com/zhuchen03/gradinit. However, it does not state that the authors' own code for the methodology described in this paper is open-sourced or provide a link to it.
Open Datasets Yes We investigate models trained on several benchmarks: CIFAR-10 (Krizhevsky, 2009) and Image Net (Russakovsky et al., 2015) for image classification, LM1B (Chelba et al., 2013) for Language Modeling, and WMT for Neural Machine Translation (NMT).
Dataset Splits Yes The NMT models are trained on the WMT 16 EN-DE training set, tuned for hyper-parameters on the WMT 16 EN-DE validation set and evaluated on the WMT 14 EN-DE test set for BLEU scores.
Hardware Specification Yes Nearly all experiments utilized the Google Cloud Platform with v2 cloud TPUs except for the following: The Figure 2 Resnet-50 and Stride-(1,1) Densenet experiments utilized the v3 cloud TPU, while the Grad Init code was run on a cloud machine with a single V100 GPU. The Figure 2 experiments were done in parallel using up to 50 v2 TPUs concurrently over the period of a few days. Additionally, all the Machine Translation models were trained on v3 cloud TPUs.
Software Dependencies No The paper mentions optimizers like SGD and Adam, and methods like Lanczos, but does not provide specific version numbers for software libraries or dependencies (e.g., PyTorch, TensorFlow, Python version, CUDA version).
Experiment Setup Yes Each model is trained with various learning rates using cosine decay (unless mentioned explicitly). For warmup experiments we use linear warmup which starts at 0 and scales linearly to a max value η before applying cosine decay. All the models are trained for 60 epochs at batch size of 1024 for Transformer-Base models, and batch size of 512 for Transformer-Big models. We use dropout of 0.1, label smoothing of 0.1 and no weight decay for all these models. The Res Net-50 (w/o BN) architecture was trained for 100 epochs at batch size 512, with l2 regularization of 5e-5, dropout of .3. It was trained with SGD with nesterov momentum of .9 and learning rate of .2. We applied gradient clipping at global l2 norm of 5 and used linear learning rate warmup with warmup period of 1000 steps.