DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION

Authors: Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show through a comprehensive empirical study that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to Ro BERTa-Large, a De BERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQu AD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%).
Researcher Affiliation Industry 1 Microsoft Dynamics 365 AI 2 Microsoft Research {penhe,xiaodl,jfgao,wzchen}@microsoft.com
Pseudocode Yes Algorithm 1 Disentangled Attention
Open Source Code Yes The pre-trained De BERTa models and the source code were released at: https://github.com/microsoft/De BERTa1. Our code and models are also available at Hugging Face Transformers: https://github.com/ huggingface/transformers, https://huggingface.co/models?filter=deberta
Open Datasets Yes For training data, we use Wikipedia (English Wikipedia dump3; 12GB), Book Corpus (Zhu et al., 2015) (6GB), OPENWEBTEXT (public Reddit content (Gokaslan & Cohen, 2019); 38GB), and STORIES (a subset of Common Crawl (Trinh & Le, 2018); 31GB). The total data size after data deduplication (Shoeybi et al., 2019) is about 78G. Refer to Appendix A.2 for a detailed description of the pre-training dataset.
Dataset Splits Yes For pre-training, we also sample 5% training data as the validation set to monitor the training process. Table 6: Summary information of the NLP application benchmarks. (lists #Dev for all tasks)
Hardware Specification Yes We use 6 DGX-2 machines (96 V100 GPUs) to train the models.
Software Dependencies No The paper mentions software like Huggingface Transformers, Fair Seq, and Megatron, and optimizers Adam and weight decay, but does not provide specific version numbers for these software components.
Experiment Setup Yes Table 8: Hyper-parameters for pre-training De BERTa. Table 9: Hyper-parameters for fine-tuning De BERTa on down-streaming tasks.