Extrapolation for Large-batch Training in Deep Learning

Authors: Tao Lin, Lingjing Kong, Sebastian Stich, Martin Jaggi

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on Res Net, LSTM, and Transformer. We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy. ... Our main contributions can be summarized as follows: We propose EXTRAP-SGD and extend it to a unified framework (extrapolated SGD) for distributed large-batch training. Extensive empirical results on three benchmarking tasks justify the effects of accelerated optimization and better generalization. We provide convergence analysis for methods in the proposed framework, as well as the SOTA large batch training method (i.e. mini-batch SGD with Nesterov momentum). Our analysis explains the large batch optimization inefficiency (diminishing linear speedup) observed in previous empirical work.
Researcher Affiliation Academia 1EPFL, Lausanne, Switzerland.
Pseudocode Yes Algorithm 1 EXTRAP-SGD
Open Source Code No The paper does not provide explicit statements or links indicating the availability of source code for the described methodology.
Open Datasets Yes We evaluate all methods on the following three tasks: (1) Image Classification for CIFAR10/100 (Krizhevsky & Hinton, 2009) (50K training samples and 10K testing samples with 10/100 classes)... (2) Language Modeling for Wiki Text2 (Merity et al., 2016)... and (3) Neural Machine Translation for Multi30k (Elliott et al., 2016).
Dataset Splits Yes Language Modeling for Wiki Text2 (Merity et al., 2016) (the vocabulary size is 33K, and its train and validation set have 2 million tokens and 217K tokens respectively)
Hardware Specification No The paper mentions running experiments on a certain number of 'workers' (e.g., K=32, 64) and mentions 'NVIDIA apex' but does not specify exact hardware models such as GPU or CPU types.
Software Dependencies No The paper mentions 'Py Torch' and 'NVIDIA apex' but does not provide specific version numbers for these software components.
Experiment Setup Yes For experiments on image classification and language modeling, unless mentioned otherwise the models are trained for 300 epochs; the local mini-batch sizes are set to 256 and 64 respectively. ... The learning rate is always gradually warmed up from a relatively small value for the first few epochs. Besides, the learning rate γ in image classification task will be dropped by a factor of 10 when the model has accessed 50% and 75% of the total number of training samples.