reproducibilityindex.ai

DynaBERT: Dynamic BERT with Adaptive Width and Depth

Authors: Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments under various efﬁciency constraints demonstrate that our proposed dynamic BERT (or Ro BERTa) at its largest size has comparable performance as BERTBASE (or Ro BERTa BASE), while at smaller widths and depths consistently outperforms existing BERT compression methods. Code is available at https://github.com/huawei-noah/ Pretrained-Language-Model/tree/master/Dyna BERT.
Researcher Affiliation	Collaboration	Lu Hou1, Zhiqi Huang2, Lifeng Shang1, Xin Jiang1, Xiao Chen1, Qun Liu1 1Huawei Noah s Ark Lab {houlu3,shang.lifeng,Jiang.Xin,chen.xiao2,qun.liu}@huawei.com 2Peking University, China zhiqihuang@pku.edu.cn
Pseudocode	Yes	Algorithm 1 Train Dyna BERTW or Dyna BERT.
Open Source Code	Yes	Code is available at https://github.com/huawei-noah/ Pretrained-Language-Model/tree/master/Dyna BERT.
Open Datasets	Yes	In this section, we evaluate the efﬁcacy of the proposed Dyna BERT on the General Language Understanding Evaluation (GLUE) tasks [26] and the machine reading comprehension task SQu AD v1.1 [19], using both BERTBASE [5] and Ro BERTa BASE [14] as the backbone models. ... The GLUE benchmark [26] is a collection of diverse natural language understanding tasks. ... SQu AD v1.1 (Stanford Question Answering Dataset) [19] contains 100k crowd-sourced question/answer pairs.
Dataset Splits	Yes	Following [5], for the development set, we report Spearman correlation for STS-B, Matthews correlation for Co LA and accuracy for the other tasks. ... Empirically, we use the development set to calculate the importance of attention heads and neurons.
Hardware Specification	Yes	We use Nvidia V100 GPU for training. ... We evaluate the efﬁcacy of our proposed Dyna BERT and Dyna Ro BERTa under different efﬁciency constraints, including #parameters, FLOPs, the latency on Nvidia K40 GPU and Kirin 810 A76 ARM CPU (Details can be found in Appendix B.3).
Software Dependencies	No	The paper mentions models and frameworks used (e.g., BERT, RoBERTa, DistilBERT, TinyBERT) and implicitly references their underlying software, but it does not specify concrete ancillary software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Detailed hyperparameters for the experiments are in Appendix B.2. ... The GPU latency is the running time of 100 batches with batch size 128 and sequence length 128. The CPU latency is tested with batch size 1 and sequence length 128.