reproducibilityindex.ai

DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION

Authors: Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show through a comprehensive empirical study that these techniques signiﬁcantly improve the efﬁciency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to Ro BERTa-Large, a De BERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQu AD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%).
Researcher Affiliation	Industry	1 Microsoft Dynamics 365 AI 2 Microsoft Research {penhe,xiaodl,jfgao,wzchen}@microsoft.com
Pseudocode	Yes	Algorithm 1 Disentangled Attention
Open Source Code	Yes	The pre-trained De BERTa models and the source code were released at: https://github.com/microsoft/De BERTa1. Our code and models are also available at Hugging Face Transformers: https://github.com/ huggingface/transformers, https://huggingface.co/models?filter=deberta
Open Datasets	Yes	For training data, we use Wikipedia (English Wikipedia dump3; 12GB), Book Corpus (Zhu et al., 2015) (6GB), OPENWEBTEXT (public Reddit content (Gokaslan & Cohen, 2019); 38GB), and STORIES (a subset of Common Crawl (Trinh & Le, 2018); 31GB). The total data size after data deduplication (Shoeybi et al., 2019) is about 78G. Refer to Appendix A.2 for a detailed description of the pre-training dataset.
Dataset Splits	Yes	For pre-training, we also sample 5% training data as the validation set to monitor the training process. Table 6: Summary information of the NLP application benchmarks. (lists #Dev for all tasks)
Hardware Specification	Yes	We use 6 DGX-2 machines (96 V100 GPUs) to train the models.
Software Dependencies	No	The paper mentions software like Huggingface Transformers, Fair Seq, and Megatron, and optimizers Adam and weight decay, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Table 8: Hyper-parameters for pre-training De BERTa. Table 9: Hyper-parameters for ﬁne-tuning De BERTa on down-streaming tasks.