Efficient Training of BERT by Progressively Stacking

Authors: Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, Tieyan Liu

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on our proposed method to see (1) whether it can improve the training efficiency and convergence rate at the pre-training step, and (2) whether the trained model can achieve similar performance compared to the baseline models.
Researcher Affiliation Collaboration 1Key Laboratory of Machine Perception, MOE, School of EECS, Peking University 2Microsoft Research 3Center for Data Science, Peking University, Beijing Institute of Big Data Research.
Pseudocode Yes Algorithm 1 Progressive stacking
Open Source Code Yes Codes for the experiments are available at https://github.com/gonglinyuan/StackingBERT
Open Datasets Yes Datasets. We follow Devlin et al. (2018) to use English Wikipedia corpus and Book Corpus for pre-training. By concatenating the two datasets, we obtain our corpus with roughly 3400M words in total... p We fine-tune each pre-trained model on 9 downstream tasks in GLUE (General Language Understanding Evaluation), a system for evaluating and analyzing the performance of models across a diverse set of existing NLU tasks (Wang et al., 2018).
Dataset Splits Yes We randomly split documents into one training set and one validation set. The training-validation ratio for pre-training is 199:1.
Hardware Specification Yes To fairly compare the speed of different algorithms, we train all models in the same computation environment with 4 NVIDIA Tesla P40 GPUs.
Software Dependencies No The paper states, 'All of our experiments are mainly based on our own reimplementation of BERT model... using fairseq... in Py Torch toolkit.' However, it does not provide specific version numbers for PyTorch, fairseq, or other software dependencies.
Experiment Setup Yes The BERT-base model is trained for 400,000 updates from scratch, and the batch size for each update is set to be 122,880 tokens. For both models, we use Adam (Kingma & Ba, 2014) as the optimizer, and for our progressively stacking method, we reset the optimizer states (the first/second moment estimation of Adam) but keep the same learning rate when switching from the shallow model to the deep model.