Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense

Authors: Rui Min, Zeyu Qin, Nevin L. Zhang, Li Shen, Minhao Cheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our observations reveal that current safety purification defense methods quickly reacquire backdoor behavior after just a few epochs, resulting in significantly high ASR levels. In contrast, the clean model (which does not have backdoor triggers inserted during the pretraining phase) and Exact Purification (EP) which fine-tunes models using real backdoored samples with correct labels during safety purification, maintain a low ASR even after the RA. This discrepancy suggests that existing safety tuning methods do not thoroughly eliminate the learned backdoor, creating a superficial impression of backdoor safety.
Researcher Affiliation Academia 1Hong Kong University of Science and Technology, 2Pennsylvania State University {rminaa, zeyu.qin}@connect.ust.hk, lzhang@cse.ust.hk mathshenli@gmail.com, minhaocheng@ust.hk
Pseudocode Yes Algorithm 1 Path-Aware Minimization (PAM)
Open Source Code No The NeurIPS Paper Checklist indicates that the paper does not provide open access to the data and code.
Open Datasets Yes All experiments are conducted on Backdoor Bench [48], a widely used benchmark for backdoor learning. We employ three poisoning rates, 10%, 5%, and 1% (in Appendix) for backdoor injection and conduct experiments on three widely used image classification datasets, including CIFAR-10 [24], Tiny-Image Net [8], and CIFAR-100 [24].
Dataset Splits Yes Given that we primarily monitor C-Acc (with the validation set) in practice, we aim to achieve a favorable trade-off between these two metrics.
Hardware Specification Yes All experiments were conducted using 4 NVIDIA 3090 GPUs.
Software Dependencies No The paper mentions using Py Torch in Section B.1, but does not specify the version number for Py Torch or any other software libraries.
Experiment Setup Yes For CIFAR-10, we adopt an initial learning rate of 0.1 to train all the backdoored models for 100 epochs. For both the CIFAR-100 and Tiny-Image Net, we utilize pretrained backbones and initialize the classifiers with appropriate class numbers. We adopt a smaller learning rate of 0.001 and fine-tune the models for 10 epochs. We upscale the image size up to 224 × 224 during both the training and inference stages following the implementation of [32].