AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning
Authors: Tao Yang, JInghao Deng, Xiaojun Quan, Qifan Wang, Shaoliang Nie
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on various benchmarks show that AD-DROP yields consistent improvements over baselines. Analysis further confirms that AD-DROP serves as a strategic regularizer to prevent overfitting during fine-tuning. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Engineering, Sun Yat-sen University 2Meta AI |
| Pseudocode | Yes | Algorithm 1 Cross-tuning |
| Open Source Code | Yes | Our code is available at https://github.com/Tao Yang225/AD-DROP. |
| Open Datasets | Yes | We conduct our main experiments on eight tasks of the GLUE benchmark [31], including SST-2 [38], MNLI [39], QNLI [40], QQP [41], Co LA [42], STS-B [43], MRPC [37], and RTE [44]. ... we conduct experiments on Named Entity Recognition (Co NLL-2003 [32]) and Machine Translation (WMT 2016 [33]) datasets... Besides, we also evaluate AD-DROP on two out-of-distribution (OOD) datasets, including HANS [34] and PAWS-X [35]. |
| Dataset Splits | Yes | After each epoch of training, we evaluate the model on the development set. Two baseline dropping strategies (i.e., dropping by random sampling and without dropping any position) are employed for comparison. We plot the loss curves of the model with these dropping strategies on both training and development sets in Figure 2. |
| Hardware Specification | Yes | We train the selected Pr LMs on Ge Force RTX 3090 GPUs. |
| Software Dependencies | No | We implement our AD-DROP in Pytorch with the Transformers package [47]. |
| Experiment Setup | Yes | We tune the learning rate in {1e-5, 2e-5, 3e-5} and the batch size in {16, 32, 64}. ... The two critical hyperparameters p and q are searched within [0.1, 0.9] with step size 0.1. For integrated gradient in Eq. (3), we follow Hao et al. [23] and set m to 20. |