Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense
Authors: Rui Min, Zeyu Qin, Nevin L. Zhang, Li Shen, Minhao Cheng
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our observations reveal that current safety purification defense methods quickly reacquire backdoor behavior after just a few epochs, resulting in significantly high ASR levels. In contrast, the clean model (which does not have backdoor triggers inserted during the pretraining phase) and Exact Purification (EP) which fine-tunes models using real backdoored samples with correct labels during safety purification, maintain a low ASR even after the RA. This discrepancy suggests that existing safety tuning methods do not thoroughly eliminate the learned backdoor, creating a superficial impression of backdoor safety. |
| Researcher Affiliation | Academia | 1Hong Kong University of Science and Technology, 2Pennsylvania State University EMAIL, EMAIL EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Path-Aware Minimization (PAM) |
| Open Source Code | No | The NeurIPS Paper Checklist indicates that the paper does not provide open access to the data and code. |
| Open Datasets | Yes | All experiments are conducted on Backdoor Bench [48], a widely used benchmark for backdoor learning. We employ three poisoning rates, 10%, 5%, and 1% (in Appendix) for backdoor injection and conduct experiments on three widely used image classification datasets, including CIFAR-10 [24], Tiny-Image Net [8], and CIFAR-100 [24]. |
| Dataset Splits | Yes | Given that we primarily monitor C-Acc (with the validation set) in practice, we aim to achieve a favorable trade-off between these two metrics. |
| Hardware Specification | Yes | All experiments were conducted using 4 NVIDIA 3090 GPUs. |
| Software Dependencies | No | The paper mentions using Py Torch in Section B.1, but does not specify the version number for Py Torch or any other software libraries. |
| Experiment Setup | Yes | For CIFAR-10, we adopt an initial learning rate of 0.1 to train all the backdoored models for 100 epochs. For both the CIFAR-100 and Tiny-Image Net, we utilize pretrained backbones and initialize the classifiers with appropriate class numbers. We adopt a smaller learning rate of 0.001 and fine-tune the models for 10 epochs. We upscale the image size up to 224 × 224 during both the training and inference stages following the implementation of [32]. |