AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning

Authors: Tao Yang, JInghao Deng, Xiaojun Quan, Qifan Wang, Shaoliang Nie

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various benchmarks show that AD-DROP yields consistent improvements over baselines. Analysis further confirms that AD-DROP serves as a strategic regularizer to prevent overfitting during fine-tuning.
Researcher Affiliation Collaboration 1School of Computer Science and Engineering, Sun Yat-sen University 2Meta AI
Pseudocode Yes Algorithm 1 Cross-tuning
Open Source Code Yes Our code is available at https://github.com/Tao Yang225/AD-DROP.
Open Datasets Yes We conduct our main experiments on eight tasks of the GLUE benchmark [31], including SST-2 [38], MNLI [39], QNLI [40], QQP [41], Co LA [42], STS-B [43], MRPC [37], and RTE [44]. ... we conduct experiments on Named Entity Recognition (Co NLL-2003 [32]) and Machine Translation (WMT 2016 [33]) datasets... Besides, we also evaluate AD-DROP on two out-of-distribution (OOD) datasets, including HANS [34] and PAWS-X [35].
Dataset Splits Yes After each epoch of training, we evaluate the model on the development set. Two baseline dropping strategies (i.e., dropping by random sampling and without dropping any position) are employed for comparison. We plot the loss curves of the model with these dropping strategies on both training and development sets in Figure 2.
Hardware Specification Yes We train the selected Pr LMs on Ge Force RTX 3090 GPUs.
Software Dependencies No We implement our AD-DROP in Pytorch with the Transformers package [47].
Experiment Setup Yes We tune the learning rate in {1e-5, 2e-5, 3e-5} and the batch size in {16, 32, 64}. ... The two critical hyperparameters p and q are searched within [0.1, 0.9] with step size 0.1. For integrated gradient in Eq. (3), we follow Hao et al. [23] and set m to 20.