reproducibilityindex.ai

Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense

Authors: Rui Min, Zeyu Qin, Nevin L. Zhang, Li Shen, Minhao Cheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our observations reveal that current safety purification defense methods quickly reacquire backdoor behavior after just a few epochs, resulting in significantly high ASR levels. In contrast, the clean model (which does not have backdoor triggers inserted during the pretraining phase) and Exact Purification (EP) which fine-tunes models using real backdoored samples with correct labels during safety purification, maintain a low ASR even after the RA. This discrepancy suggests that existing safety tuning methods do not thoroughly eliminate the learned backdoor, creating a superficial impression of backdoor safety.
Researcher Affiliation	Academia	1Hong Kong University of Science and Technology, 2Pennsylvania State University {rminaa, zeyu.qin}@connect.ust.hk, lzhang@cse.ust.hk mathshenli@gmail.com, minhaocheng@ust.hk
Pseudocode	Yes	Algorithm 1 Path-Aware Minimization (PAM)
Open Source Code	No	The NeurIPS Paper Checklist indicates that the paper does not provide open access to the data and code.
Open Datasets	Yes	All experiments are conducted on Backdoor Bench [48], a widely used benchmark for backdoor learning. We employ three poisoning rates, 10%, 5%, and 1% (in Appendix) for backdoor injection and conduct experiments on three widely used image classification datasets, including CIFAR-10 [24], Tiny-Image Net [8], and CIFAR-100 [24].
Dataset Splits	Yes	Given that we primarily monitor C-Acc (with the validation set) in practice, we aim to achieve a favorable trade-off between these two metrics.
Hardware Specification	Yes	All experiments were conducted using 4 NVIDIA 3090 GPUs.
Software Dependencies	No	The paper mentions using Py Torch in Section B.1, but does not specify the version number for Py Torch or any other software libraries.
Experiment Setup	Yes	For CIFAR-10, we adopt an initial learning rate of 0.1 to train all the backdoored models for 100 epochs. For both the CIFAR-100 and Tiny-Image Net, we utilize pretrained backbones and initialize the classifiers with appropriate class numbers. We adopt a smaller learning rate of 0.001 and fine-tune the models for 10 epochs. We upscale the image size up to 224 × 224 during both the training and inference stages following the implementation of [32].