DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION
Authors: Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show through a comprehensive empirical study that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to Ro BERTa-Large, a De BERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQu AD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). |
| Researcher Affiliation | Industry | 1 Microsoft Dynamics 365 AI 2 Microsoft Research {penhe,xiaodl,jfgao,wzchen}@microsoft.com |
| Pseudocode | Yes | Algorithm 1 Disentangled Attention |
| Open Source Code | Yes | The pre-trained De BERTa models and the source code were released at: https://github.com/microsoft/De BERTa1. Our code and models are also available at Hugging Face Transformers: https://github.com/ huggingface/transformers, https://huggingface.co/models?filter=deberta |
| Open Datasets | Yes | For training data, we use Wikipedia (English Wikipedia dump3; 12GB), Book Corpus (Zhu et al., 2015) (6GB), OPENWEBTEXT (public Reddit content (Gokaslan & Cohen, 2019); 38GB), and STORIES (a subset of Common Crawl (Trinh & Le, 2018); 31GB). The total data size after data deduplication (Shoeybi et al., 2019) is about 78G. Refer to Appendix A.2 for a detailed description of the pre-training dataset. |
| Dataset Splits | Yes | For pre-training, we also sample 5% training data as the validation set to monitor the training process. Table 6: Summary information of the NLP application benchmarks. (lists #Dev for all tasks) |
| Hardware Specification | Yes | We use 6 DGX-2 machines (96 V100 GPUs) to train the models. |
| Software Dependencies | No | The paper mentions software like Huggingface Transformers, Fair Seq, and Megatron, and optimizers Adam and weight decay, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Table 8: Hyper-parameters for pre-training De BERTa. Table 9: Hyper-parameters for fine-tuning De BERTa on down-streaming tasks. |