Causality Based Front-door Defense Against Backdoor Attack on Language Models

Authors: Yiran Liu, Xiaoang Xu, Zhiyi Hou, Yang Yu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our defense experiments against various attack methods at the token, sentence, and syntactic levels reduced the attack success rate from 93.63% to 15.12%, improving the defense effect by 2.91 times compared to the best baseline result of 66.61%, achieving state-of-the-art results.
Researcher Affiliation Academia 1Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2School of Computer Science and Technology, Harbin University of Science and Technology, Harbin, China 3Faculty of Computing, Harbin Institute of Technology, Harbin, China 4School of Economics and Management, China University of Petroleum, Beijing, China.
Pseudocode No The paper describes the framework's modules and mathematical formulas but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Our code to reproduce the experiments is available at: https://github.com/lyr17/Frontdoor Adjustment-Backdoor-Elimination.
Open Datasets Yes The datasets we use are SST-2 (Socher et al., 2013), Offenseval (Zampieri et al., 2020) and HSOL (Davidson et al., 2017).
Dataset Splits Yes The detail of datasets and victim models are shown in Table3 and Table4. Table 3. The detail of SST-2, Offenseval and HSOL. (Includes 'dev' for data number)
Hardware Specification Yes Model training leverages eight Nvidia V100 GPUs, using Adam (Kingma & Ba, 2014) for optimization with a learning rate of 1 10 5 and 1000 warmup steps.
Software Dependencies No The paper mentions using 'Adam' for optimization and 'Transformers library' but does not specify software versions for these or other key components (e.g., Python, PyTorch version).
Experiment Setup Yes Model training leverages eight Nvidia V100 GPUs, using Adam (Kingma & Ba, 2014) for optimization with a learning rate of 1 10 5 and 1000 warmup steps. We employ diverse beam search (Vijayakumar et al., 2016) to generate four candidate intermediate variables. The margin coefficient λ in Equation (12) is 0.1, while the length normalization term α in the model score function is 2.0 across datasets. The MLE loss weight β is set at 1.0 (Equation 13).