Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Authors: Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and RESNET-50 training with very little hyperparameter tuning.
Researcher Affiliation Collaboration Yang You2, Jing Li1, Sashank Reddi1, Jonathan Hseu1, Sanjiv Kumar1, Srinadh Bhojanapalli1 Xiaodan Song1, James Demmel2, Kurt Keutzer2, Cho-Jui Hsieh1,3 Google1, UC Berkeley2, UCLA3
Pseudocode Yes Algorithm 1 LARS; Algorithm 2 LAMB
Open Source Code Yes The LAMB implementation is available online1. 1https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py
Open Datasets Yes For this experiment, we use the same dataset as Devlin et al. (2018), which is a concatenation of Wikipedia and Books Corpus with 2.5B and 800M words respectively.; Image Net training with Res Net-50 is an industry standard metric that is being used in MLPerf4.
Dataset Splits Yes We specifically focus on the SQu AD task2 in this paper. The F1 score on SQu AD-v1 is used as the accuracy metric in our experiments.; Table 3: Top-1 validation accuracy of Image Net/RESNET-50 training at the batch size of 16K (90 epochs).
Hardware Specification Yes We use TPUv3 in all the experiments. A TPUv3 Pod has 1024 chips and can provide more than 100 petaflops performance for mixed precision computing.
Software Dependencies No The paper mentions software components like TensorFlow (indirectly through the GitHub link) and optimizers (ADAM, ADAGRAD, ADAMW, LARS, LAMB) but does not provide specific version numbers for any of them.
Experiment Setup Yes The parameters β1 and β2 in Algorithm 2 are set to 0.9 and 0.999 respectively in all our experiments; we only tune the learning rate. We use a polynomially decaying learning rate of ηt = η0 (1 − t/T) in Algorithm 2), which is the same as in BERT baseline. This setting also works for all other applications in this paper.