Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Authors: Minjia Zhang, Yuxiong He

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on BERT show that the proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 faster than the baseline to get a similar accuracy on downstream tasks.
Researcher Affiliation Industry Minjia Zhang Yuxiong He Microsoft Corporation {minjiaz,yuxhe}@microsoft.com
Pseudocode Yes Algorithm 1 Progressive_Layer_Dropping
Open Source Code Yes We will open-source the code so that other practitioners and researchers can reproduce our results or re-use code into their ventures in this field.
Open Datasets Yes We follow Devlin et al. [3] to use English Wikipedia corpus and Book Corpus for pretraining. By concatenating the two datasets, we obtain our corpus with roughly 2.8B word tokens in total, which is comparable with the data corpus used in Devlin et al. [3].
Dataset Splits Yes We split documents into one training set and one validation set (300:1). For fine-tuning, we use GLUE (General Language Understanding Evaluation), a collection of 9 sentence or sentence-pair natural language understanding tasks...
Hardware Specification Yes All experiments are performed on 4 DGX-2 boxes with 64 V100 GPUs.
Software Dependencies No The paper mentions 'PyTorch implementation' and 'Py Torch DDP (Distributed Data Parallel) library' but does not specify exact version numbers for these software dependencies.
Experiment Setup Yes We use a warm-up ratio of 0.02 with lrmax=1e-4. Following [3], we use Adam as the optimizer. We train with batch size 4K for 200K steps, which is approximately 186 epochs. The detailed parameter settings are listed in the Appendix A.