reproducibilityindex.ai

Efficient Training of BERT by Progressively Stacking

Authors: Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, Tieyan Liu

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on our proposed method to see (1) whether it can improve the training efﬁciency and convergence rate at the pre-training step, and (2) whether the trained model can achieve similar performance compared to the baseline models.
Researcher Affiliation	Collaboration	1Key Laboratory of Machine Perception, MOE, School of EECS, Peking University 2Microsoft Research 3Center for Data Science, Peking University, Beijing Institute of Big Data Research.
Pseudocode	Yes	Algorithm 1 Progressive stacking
Open Source Code	Yes	Codes for the experiments are available at https://github.com/gonglinyuan/StackingBERT
Open Datasets	Yes	Datasets. We follow Devlin et al. (2018) to use English Wikipedia corpus and Book Corpus for pre-training. By concatenating the two datasets, we obtain our corpus with roughly 3400M words in total... p We ﬁne-tune each pre-trained model on 9 downstream tasks in GLUE (General Language Understanding Evaluation), a system for evaluating and analyzing the performance of models across a diverse set of existing NLU tasks (Wang et al., 2018).
Dataset Splits	Yes	We randomly split documents into one training set and one validation set. The training-validation ratio for pre-training is 199:1.
Hardware Specification	Yes	To fairly compare the speed of different algorithms, we train all models in the same computation environment with 4 NVIDIA Tesla P40 GPUs.
Software Dependencies	No	The paper states, 'All of our experiments are mainly based on our own reimplementation of BERT model... using fairseq... in Py Torch toolkit.' However, it does not provide specific version numbers for PyTorch, fairseq, or other software dependencies.
Experiment Setup	Yes	The BERT-base model is trained for 400,000 updates from scratch, and the batch size for each update is set to be 122,880 tokens. For both models, we use Adam (Kingma & Ba, 2014) as the optimizer, and for our progressively stacking method, we reset the optimizer states (the ﬁrst/second moment estimation of Adam) but keep the same learning rate when switching from the shallow model to the deep model.