reproducibilityindex.ai

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Authors: Minjia Zhang, Yuxiong He

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on BERT show that the proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 faster than the baseline to get a similar accuracy on downstream tasks.
Researcher Affiliation	Industry	Minjia Zhang Yuxiong He Microsoft Corporation {minjiaz,yuxhe}@microsoft.com
Pseudocode	Yes	Algorithm 1 Progressive_Layer_Dropping
Open Source Code	Yes	We will open-source the code so that other practitioners and researchers can reproduce our results or re-use code into their ventures in this field.
Open Datasets	Yes	We follow Devlin et al. [3] to use English Wikipedia corpus and Book Corpus for pretraining. By concatenating the two datasets, we obtain our corpus with roughly 2.8B word tokens in total, which is comparable with the data corpus used in Devlin et al. [3].
Dataset Splits	Yes	We split documents into one training set and one validation set (300:1). For ﬁne-tuning, we use GLUE (General Language Understanding Evaluation), a collection of 9 sentence or sentence-pair natural language understanding tasks...
Hardware Specification	Yes	All experiments are performed on 4 DGX-2 boxes with 64 V100 GPUs.
Software Dependencies	No	The paper mentions 'PyTorch implementation' and 'Py Torch DDP (Distributed Data Parallel) library' but does not specify exact version numbers for these software dependencies.
Experiment Setup	Yes	We use a warm-up ratio of 0.02 with lrmax=1e-4. Following [3], we use Adam as the optimizer. We train with batch size 4K for 200K steps, which is approximately 186 epochs. The detailed parameter settings are listed in the Appendix A.