Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
Authors: Minjia Zhang, Yuxiong He
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on BERT show that the proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 faster than the baseline to get a similar accuracy on downstream tasks. |
| Researcher Affiliation | Industry | Minjia Zhang Yuxiong He Microsoft Corporation {minjiaz,yuxhe}@microsoft.com |
| Pseudocode | Yes | Algorithm 1 Progressive_Layer_Dropping |
| Open Source Code | Yes | We will open-source the code so that other practitioners and researchers can reproduce our results or re-use code into their ventures in this field. |
| Open Datasets | Yes | We follow Devlin et al. [3] to use English Wikipedia corpus and Book Corpus for pretraining. By concatenating the two datasets, we obtain our corpus with roughly 2.8B word tokens in total, which is comparable with the data corpus used in Devlin et al. [3]. |
| Dataset Splits | Yes | We split documents into one training set and one validation set (300:1). For fine-tuning, we use GLUE (General Language Understanding Evaluation), a collection of 9 sentence or sentence-pair natural language understanding tasks... |
| Hardware Specification | Yes | All experiments are performed on 4 DGX-2 boxes with 64 V100 GPUs. |
| Software Dependencies | No | The paper mentions 'PyTorch implementation' and 'Py Torch DDP (Distributed Data Parallel) library' but does not specify exact version numbers for these software dependencies. |
| Experiment Setup | Yes | We use a warm-up ratio of 0.02 with lrmax=1e-4. Following [3], we use Adam as the optimizer. We train with batch size 4K for 200K steps, which is approximately 186 epochs. The detailed parameter settings are listed in the Appendix A. |