Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense
Authors: Rui Min, Zeyu Qin, Nevin L. Zhang, Li Shen, Minhao Cheng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our observations reveal that current safety purification defense methods quickly reacquire backdoor behavior after just a few epochs, resulting in significantly high ASR levels. In contrast, the clean model (which does not have backdoor triggers inserted during the pretraining phase) and Exact Purification (EP) which fine-tunes models using real backdoored samples with correct labels during safety purification, maintain a low ASR even after the RA. This discrepancy suggests that existing safety tuning methods do not thoroughly eliminate the learned backdoor, creating a superficial impression of backdoor safety. |
| Researcher Affiliation | Academia | 1Hong Kong University of Science and Technology, 2Pennsylvania State University {rminaa, zeyu.qin}@connect.ust.hk, lzhang@cse.ust.hk mathshenli@gmail.com, minhaocheng@ust.hk |
| Pseudocode | Yes | Algorithm 1 Path-Aware Minimization (PAM) |
| Open Source Code | No | The NeurIPS Paper Checklist indicates that the paper does not provide open access to the data and code. |
| Open Datasets | Yes | All experiments are conducted on Backdoor Bench [48], a widely used benchmark for backdoor learning. We employ three poisoning rates, 10%, 5%, and 1% (in Appendix) for backdoor injection and conduct experiments on three widely used image classification datasets, including CIFAR-10 [24], Tiny-Image Net [8], and CIFAR-100 [24]. |
| Dataset Splits | Yes | Given that we primarily monitor C-Acc (with the validation set) in practice, we aim to achieve a favorable trade-off between these two metrics. |
| Hardware Specification | Yes | All experiments were conducted using 4 NVIDIA 3090 GPUs. |
| Software Dependencies | No | The paper mentions using Py Torch in Section B.1, but does not specify the version number for Py Torch or any other software libraries. |
| Experiment Setup | Yes | For CIFAR-10, we adopt an initial learning rate of 0.1 to train all the backdoored models for 100 epochs. For both the CIFAR-100 and Tiny-Image Net, we utilize pretrained backbones and initialize the classifiers with appropriate class numbers. We adopt a smaller learning rate of 0.001 and fine-tune the models for 10 epochs. We upscale the image size up to 224 × 224 during both the training and inference stages following the implementation of [32]. |