DynaBERT: Dynamic BERT with Adaptive Width and Depth
Authors: Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments under various efficiency constraints demonstrate that our proposed dynamic BERT (or Ro BERTa) at its largest size has comparable performance as BERTBASE (or Ro BERTa BASE), while at smaller widths and depths consistently outperforms existing BERT compression methods. Code is available at https://github.com/huawei-noah/ Pretrained-Language-Model/tree/master/Dyna BERT. |
| Researcher Affiliation | Collaboration | Lu Hou1, Zhiqi Huang2, Lifeng Shang1, Xin Jiang1, Xiao Chen1, Qun Liu1 1Huawei Noah s Ark Lab {houlu3,shang.lifeng,Jiang.Xin,chen.xiao2,qun.liu}@huawei.com 2Peking University, China zhiqihuang@pku.edu.cn |
| Pseudocode | Yes | Algorithm 1 Train Dyna BERTW or Dyna BERT. |
| Open Source Code | Yes | Code is available at https://github.com/huawei-noah/ Pretrained-Language-Model/tree/master/Dyna BERT. |
| Open Datasets | Yes | In this section, we evaluate the efficacy of the proposed Dyna BERT on the General Language Understanding Evaluation (GLUE) tasks [26] and the machine reading comprehension task SQu AD v1.1 [19], using both BERTBASE [5] and Ro BERTa BASE [14] as the backbone models. ... The GLUE benchmark [26] is a collection of diverse natural language understanding tasks. ... SQu AD v1.1 (Stanford Question Answering Dataset) [19] contains 100k crowd-sourced question/answer pairs. |
| Dataset Splits | Yes | Following [5], for the development set, we report Spearman correlation for STS-B, Matthews correlation for Co LA and accuracy for the other tasks. ... Empirically, we use the development set to calculate the importance of attention heads and neurons. |
| Hardware Specification | Yes | We use Nvidia V100 GPU for training. ... We evaluate the efficacy of our proposed Dyna BERT and Dyna Ro BERTa under different efficiency constraints, including #parameters, FLOPs, the latency on Nvidia K40 GPU and Kirin 810 A76 ARM CPU (Details can be found in Appendix B.3). |
| Software Dependencies | No | The paper mentions models and frameworks used (e.g., BERT, RoBERTa, DistilBERT, TinyBERT) and implicitly references their underlying software, but it does not specify concrete ancillary software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Detailed hyperparameters for the experiments are in Appendix B.2. ... The GPU latency is the running time of 100 batches with batch size 128 and sequence length 128. The CPU latency is tested with batch size 1 and sequence length 128. |