Extrapolation for Large-batch Training in Deep Learning
Authors: Tao Lin, Lingjing Kong, Sebastian Stich, Martin Jaggi
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on Res Net, LSTM, and Transformer. We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy. ... Our main contributions can be summarized as follows: We propose EXTRAP-SGD and extend it to a unified framework (extrapolated SGD) for distributed large-batch training. Extensive empirical results on three benchmarking tasks justify the effects of accelerated optimization and better generalization. We provide convergence analysis for methods in the proposed framework, as well as the SOTA large batch training method (i.e. mini-batch SGD with Nesterov momentum). Our analysis explains the large batch optimization inefficiency (diminishing linear speedup) observed in previous empirical work. |
| Researcher Affiliation | Academia | 1EPFL, Lausanne, Switzerland. |
| Pseudocode | Yes | Algorithm 1 EXTRAP-SGD |
| Open Source Code | No | The paper does not provide explicit statements or links indicating the availability of source code for the described methodology. |
| Open Datasets | Yes | We evaluate all methods on the following three tasks: (1) Image Classification for CIFAR10/100 (Krizhevsky & Hinton, 2009) (50K training samples and 10K testing samples with 10/100 classes)... (2) Language Modeling for Wiki Text2 (Merity et al., 2016)... and (3) Neural Machine Translation for Multi30k (Elliott et al., 2016). |
| Dataset Splits | Yes | Language Modeling for Wiki Text2 (Merity et al., 2016) (the vocabulary size is 33K, and its train and validation set have 2 million tokens and 217K tokens respectively) |
| Hardware Specification | No | The paper mentions running experiments on a certain number of 'workers' (e.g., K=32, 64) and mentions 'NVIDIA apex' but does not specify exact hardware models such as GPU or CPU types. |
| Software Dependencies | No | The paper mentions 'Py Torch' and 'NVIDIA apex' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | For experiments on image classification and language modeling, unless mentioned otherwise the models are trained for 300 epochs; the local mini-batch sizes are set to 256 and 64 respectively. ... The learning rate is always gradually warmed up from a relatively small value for the first few epochs. Besides, the learning rate γ in image classification task will be dropped by a factor of 10 when the model has accessed 50% and 75% of the total number of training samples. |