DynaBERT: Dynamic BERT with Adaptive Width and Depth

Authors: Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments under various efficiency constraints demonstrate that our proposed dynamic BERT (or Ro BERTa) at its largest size has comparable performance as BERTBASE (or Ro BERTa BASE), while at smaller widths and depths consistently outperforms existing BERT compression methods. Code is available at https://github.com/huawei-noah/ Pretrained-Language-Model/tree/master/Dyna BERT.
Researcher Affiliation Collaboration Lu Hou1, Zhiqi Huang2, Lifeng Shang1, Xin Jiang1, Xiao Chen1, Qun Liu1 1Huawei Noah s Ark Lab {houlu3,shang.lifeng,Jiang.Xin,chen.xiao2,qun.liu}@huawei.com 2Peking University, China zhiqihuang@pku.edu.cn
Pseudocode Yes Algorithm 1 Train Dyna BERTW or Dyna BERT.
Open Source Code Yes Code is available at https://github.com/huawei-noah/ Pretrained-Language-Model/tree/master/Dyna BERT.
Open Datasets Yes In this section, we evaluate the efficacy of the proposed Dyna BERT on the General Language Understanding Evaluation (GLUE) tasks [26] and the machine reading comprehension task SQu AD v1.1 [19], using both BERTBASE [5] and Ro BERTa BASE [14] as the backbone models. ... The GLUE benchmark [26] is a collection of diverse natural language understanding tasks. ... SQu AD v1.1 (Stanford Question Answering Dataset) [19] contains 100k crowd-sourced question/answer pairs.
Dataset Splits Yes Following [5], for the development set, we report Spearman correlation for STS-B, Matthews correlation for Co LA and accuracy for the other tasks. ... Empirically, we use the development set to calculate the importance of attention heads and neurons.
Hardware Specification Yes We use Nvidia V100 GPU for training. ... We evaluate the efficacy of our proposed Dyna BERT and Dyna Ro BERTa under different efficiency constraints, including #parameters, FLOPs, the latency on Nvidia K40 GPU and Kirin 810 A76 ARM CPU (Details can be found in Appendix B.3).
Software Dependencies No The paper mentions models and frameworks used (e.g., BERT, RoBERTa, DistilBERT, TinyBERT) and implicitly references their underlying software, but it does not specify concrete ancillary software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Detailed hyperparameters for the experiments are in Appendix B.2. ... The GPU latency is the running time of 100 batches with batch size 128 and sequence length 128. The CPU latency is tested with batch size 1 and sequence length 128.