Moderate-fitting as a Natural Backdoor Defender for Pre-trained Language Models

Authors: Biru Zhu, Yujia Qin, Ganqu Cui, Yangyi Chen, Weilin Zhao, Chong Fu, Yangdong Deng, Zhiyuan Liu, Jingang Wang, Wei Wu, Maosong Sun, Ming Gu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the effectiveness of our methods in defending against several representative NLP backdoor attacks.
Researcher Affiliation Collaboration Biru Zhu1 , Yujia Qin2 , Ganqu Cui2, Yangyi Chen3, Weilin Zhao2, Chong Fu4, Yangdong Deng1 , Zhiyuan Liu2 , Jingang Wang5, Wei Wu5, Maosong Sun2, Ming Gu1 1 School of Software, Tsinghua University, Beijing, China 2 Department of Computer Science and Technology, Tsinghua University, Beijing, China 3 University of Illinois Urbana-Champaign 4 Zhejiang University, Hangzhou, China 5 Meituan, Beijing, China
Pseudocode No The paper describes the mathematical formulation of its reparameterization network using Equation (1) and provides a textual description of its components, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes The codes are publicly available at https://github.com/thunlp/Moderate-fitting.
Open Datasets Yes We perform experiments on three datasets, including SST-2 (Socher et al., 2013), AG News (Zhang et al., 2015) and Hate Speech and Offensive Language (HSOL) (Davidson et al., 2017). ... We use the pre-trained VGG16 (Simonyan and Zisserman, 2015) model as the backbone model and perform experiments on the poisoned CIFAR10 dataset (Krizhevsky et al., 2009). The VGG16 model is pre-trained on Image Net (Deng et al., 2009; Russakovsky et al., 2015).
Dataset Splits Yes In addition to the negative and positive samples in the original development set, we create poisoned samples by inserting triggers into the same negative samples. [From Appendix B.1]: We use the standard train, validation, and test split for all datasets.
Hardware Specification Yes All experiments are conducted on 8 NVIDIA A100 GPUs.
Software Dependencies No The paper states its code is implemented based on 'PyTorch (Paszke et al., 2019) and Huggingface Transformers (Wolf et al., 2020)', but it does not specify version numbers for these software dependencies, making the setup not fully reproducible in terms of exact software versions.
Experiment Setup Yes The number of training epochs is set as 10. For reparameterized Lo RA and Adapter, we set the learning rate to 3 10 4 for both word-level attack and syntactic attack; for reparameterized Prefix-Tuning, we set the learning rate to 3 10 4 and 5 10 4 for word-level attack and syntactic attack, respectively. ... We set the training epochs to {10, 2, 1} and choose a learning rate of 2 10 5 for the word-level attack and 5 10 6 for the syntactic attack; for the latter, the learning rate is chosen from {2 10 5, 5 10 6, 1 10 6, 5 10 7} and the number of training epochs is set as 10.