reproducibilityindex.ai

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Authors: Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and RESNET-50 training with very little hyperparameter tuning.
Researcher Affiliation	Collaboration	Yang You2, Jing Li1, Sashank Reddi1, Jonathan Hseu1, Sanjiv Kumar1, Srinadh Bhojanapalli1 Xiaodan Song1, James Demmel2, Kurt Keutzer2, Cho-Jui Hsieh1,3 Google1, UC Berkeley2, UCLA3
Pseudocode	Yes	Algorithm 1 LARS; Algorithm 2 LAMB
Open Source Code	Yes	The LAMB implementation is available online1. 1https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py
Open Datasets	Yes	For this experiment, we use the same dataset as Devlin et al. (2018), which is a concatenation of Wikipedia and Books Corpus with 2.5B and 800M words respectively.; Image Net training with Res Net-50 is an industry standard metric that is being used in MLPerf4.
Dataset Splits	Yes	We speciﬁcally focus on the SQu AD task2 in this paper. The F1 score on SQu AD-v1 is used as the accuracy metric in our experiments.; Table 3: Top-1 validation accuracy of Image Net/RESNET-50 training at the batch size of 16K (90 epochs).
Hardware Specification	Yes	We use TPUv3 in all the experiments. A TPUv3 Pod has 1024 chips and can provide more than 100 petaﬂops performance for mixed precision computing.
Software Dependencies	No	The paper mentions software components like TensorFlow (indirectly through the GitHub link) and optimizers (ADAM, ADAGRAD, ADAMW, LARS, LAMB) but does not provide specific version numbers for any of them.
Experiment Setup	Yes	The parameters β1 and β2 in Algorithm 2 are set to 0.9 and 0.999 respectively in all our experiments; we only tune the learning rate. We use a polynomially decaying learning rate of ηt = η0 (1 − t/T) in Algorithm 2), which is the same as in BERT baseline. This setting also works for all other applications in this paper.