DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

Authors: Pengcheng He, Jianfeng Gao, Weizhu Chen

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have pre-trained De BERTa V3 using the same settings as De BERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks.
Researcher Affiliation Industry Pengcheng He1, Jianfeng Gao2, Weizhu Chen1 1 Microsoft Azure AI 2 Microsoft Research {penhe,jfgao,wzchen}@microsoft.com
Pseudocode No The paper describes the implementation steps for GDES in prose within Section 3.3, but it does not present these steps in a formalized pseudocode block or algorithm box.
Open Source Code Yes Our models and code are publicly available at https://github.com/microsoft/De BERTa.
Open Datasets Yes In this implementation, Wikipedia and the bookcorpus (Zhu et al., 2015) are used as training data, following the base model configuration of Devlin et al. (2019). The multi-lingual version of De BERTa V3 is trained with 2.5T CC100 data which is the same as XLM-R.
Dataset Splits Yes For pre-training, we also sample 5% of the training data as the validation set to monitor the training process.
Hardware Specification No The paper mentions that fine-tuning runs took 'about 1-2 hours on a DGX-2 node.', but it does not specify the types or quantities of GPUs, CPUs, or other detailed hardware components within that node.
Software Dependencies No The paper mentions using 'Adam W' and 'Adam' optimizers and that their 'code is implemented based on De BERTa (He et al., 2020) and ELECTRA (Clark et al., 2020)', but it does not provide specific version numbers for these software libraries, programming languages (e.g., Python), or underlying frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes The batch size is set to 2048, and the model is trained for 125,000 steps with a learning rate of 5e-4 and warmup steps of 10,000. Following Clark et al. (2020), we use λ 50 with the same optimization hyperparameters. All the models are trained for 500,000 steps with a batch size of 8192 and warming up steps of 10,000. The learning rate for base and small model is 5e-4, while the learning rate for large model is 3e-4. Following the De BERTa setting, we use the Adam W (Loshchilov & Hutter, 2018) optimizer which is a fixed version of Adam (Kingma & Ba, 2014) with weight decay, and set β1 0.9, β2 0.98 for the optimizer. We provide more details on the hyper parameters of pre-training and fine-tuning in the Appendix.