Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Authors: Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and RESNET-50 training with very little hyperparameter tuning. |
| Researcher Affiliation | Collaboration | Yang You2, Jing Li1, Sashank Reddi1, Jonathan Hseu1, Sanjiv Kumar1, Srinadh Bhojanapalli1 Xiaodan Song1, James Demmel2, Kurt Keutzer2, Cho-Jui Hsieh1,3 Google1, UC Berkeley2, UCLA3 |
| Pseudocode | Yes | Algorithm 1 LARS; Algorithm 2 LAMB |
| Open Source Code | Yes | The LAMB implementation is available online1. 1https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py |
| Open Datasets | Yes | For this experiment, we use the same dataset as Devlin et al. (2018), which is a concatenation of Wikipedia and Books Corpus with 2.5B and 800M words respectively.; Image Net training with Res Net-50 is an industry standard metric that is being used in MLPerf4. |
| Dataset Splits | Yes | We specifically focus on the SQu AD task2 in this paper. The F1 score on SQu AD-v1 is used as the accuracy metric in our experiments.; Table 3: Top-1 validation accuracy of Image Net/RESNET-50 training at the batch size of 16K (90 epochs). |
| Hardware Specification | Yes | We use TPUv3 in all the experiments. A TPUv3 Pod has 1024 chips and can provide more than 100 petaflops performance for mixed precision computing. |
| Software Dependencies | No | The paper mentions software components like TensorFlow (indirectly through the GitHub link) and optimizers (ADAM, ADAGRAD, ADAMW, LARS, LAMB) but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | The parameters β1 and β2 in Algorithm 2 are set to 0.9 and 0.999 respectively in all our experiments; we only tune the learning rate. We use a polynomially decaying learning rate of ηt = η0 (1 − t/T) in Algorithm 2), which is the same as in BERT baseline. This setting also works for all other applications in this paper. |