reproducibilityindex.ai

Extrapolation for Large-batch Training in Deep Learning

Authors: Tao Lin, Lingjing Kong, Sebastian Stich, Martin Jaggi

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on Res Net, LSTM, and Transformer. We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy. ... Our main contributions can be summarized as follows: We propose EXTRAP-SGD and extend it to a uniﬁed framework (extrapolated SGD) for distributed large-batch training. Extensive empirical results on three benchmarking tasks justify the effects of accelerated optimization and better generalization. We provide convergence analysis for methods in the proposed framework, as well as the SOTA large batch training method (i.e. mini-batch SGD with Nesterov momentum). Our analysis explains the large batch optimization inefﬁciency (diminishing linear speedup) observed in previous empirical work.
Researcher Affiliation	Academia	1EPFL, Lausanne, Switzerland.
Pseudocode	Yes	Algorithm 1 EXTRAP-SGD
Open Source Code	No	The paper does not provide explicit statements or links indicating the availability of source code for the described methodology.
Open Datasets	Yes	We evaluate all methods on the following three tasks: (1) Image Classiﬁcation for CIFAR10/100 (Krizhevsky & Hinton, 2009) (50K training samples and 10K testing samples with 10/100 classes)... (2) Language Modeling for Wiki Text2 (Merity et al., 2016)... and (3) Neural Machine Translation for Multi30k (Elliott et al., 2016).
Dataset Splits	Yes	Language Modeling for Wiki Text2 (Merity et al., 2016) (the vocabulary size is 33K, and its train and validation set have 2 million tokens and 217K tokens respectively)
Hardware Specification	No	The paper mentions running experiments on a certain number of 'workers' (e.g., K=32, 64) and mentions 'NVIDIA apex' but does not specify exact hardware models such as GPU or CPU types.
Software Dependencies	No	The paper mentions 'Py Torch' and 'NVIDIA apex' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	For experiments on image classiﬁcation and language modeling, unless mentioned otherwise the models are trained for 300 epochs; the local mini-batch sizes are set to 256 and 64 respectively. ... The learning rate is always gradually warmed up from a relatively small value for the ﬁrst few epochs. Besides, the learning rate γ in image classiﬁcation task will be dropped by a factor of 10 when the model has accessed 50% and 75% of the total number of training samples.