reproducibilityindex.ai

A Loss Curvature Perspective on Training Instabilities of Deep Learning Models

Authors: Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Edward Dahl, Zachary Nado, Orhan Firat

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we design a series of large scale experiments studying the evolution of the loss sharpness as we vary the learning rate, warmup period, initialization, and architectural choices.
Researcher Affiliation	Industry	Justin Gilmer Behrooz Ghorbani Ankush Garg Sneha Kudugunta Behnam Neyshabur David Cardoze George E. Dahl Zack Nado Orhan Firat Correspondence to {gilmer, ghorbani}@google.com.
Pseudocode	No	The paper describes methods like the Lanczos method for Hessian estimation but does not include any formal pseudocode blocks or algorithms.
Open Source Code	No	The paper references and uses "the open sourced code of Zhu et al. (2021)" and "the Gradinit codebase" from others, including a link to https://github.com/zhuchen03/gradinit. However, it does not state that the authors' own code for the methodology described in this paper is open-sourced or provide a link to it.
Open Datasets	Yes	We investigate models trained on several benchmarks: CIFAR-10 (Krizhevsky, 2009) and Image Net (Russakovsky et al., 2015) for image classiﬁcation, LM1B (Chelba et al., 2013) for Language Modeling, and WMT for Neural Machine Translation (NMT).
Dataset Splits	Yes	The NMT models are trained on the WMT 16 EN-DE training set, tuned for hyper-parameters on the WMT 16 EN-DE validation set and evaluated on the WMT 14 EN-DE test set for BLEU scores.
Hardware Specification	Yes	Nearly all experiments utilized the Google Cloud Platform with v2 cloud TPUs except for the following: The Figure 2 Resnet-50 and Stride-(1,1) Densenet experiments utilized the v3 cloud TPU, while the Grad Init code was run on a cloud machine with a single V100 GPU. The Figure 2 experiments were done in parallel using up to 50 v2 TPUs concurrently over the period of a few days. Additionally, all the Machine Translation models were trained on v3 cloud TPUs.
Software Dependencies	No	The paper mentions optimizers like SGD and Adam, and methods like Lanczos, but does not provide specific version numbers for software libraries or dependencies (e.g., PyTorch, TensorFlow, Python version, CUDA version).
Experiment Setup	Yes	Each model is trained with various learning rates using cosine decay (unless mentioned explicitly). For warmup experiments we use linear warmup which starts at 0 and scales linearly to a max value η before applying cosine decay. All the models are trained for 60 epochs at batch size of 1024 for Transformer-Base models, and batch size of 512 for Transformer-Big models. We use dropout of 0.1, label smoothing of 0.1 and no weight decay for all these models. The Res Net-50 (w/o BN) architecture was trained for 100 epochs at batch size 512, with l2 regularization of 5e-5, dropout of .3. It was trained with SGD with nesterov momentum of .9 and learning rate of .2. We applied gradient clipping at global l2 norm of 5 and used linear learning rate warmup with warmup period of 1000 steps.