Rethinking the Reverse-engineering of Trojan Triggers

Authors: Zhenting Wang, Kai Mei, Hailun Ding, Juan Zhai, Shiqing Ma

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results on four datasets and seven different attacks demonstrate that our solution effectively defends both input-space and feature-space Trojans. It outperforms state-of-the-art reverse-engineering methods and other types of defenses in both Trojaned model detection and mitigation tasks. On average, the detection accuracy of our method is 93%. For Trojan mitigation, our method can reduce the ASR (attack success rate) to only 0.26% with the BA (benign accuracy) remaining nearly unchanged. Our code can be found at https://github.com/RU-System-Software-and-Security/Feature RE.
Researcher Affiliation Academia Zhenting Wang Rutgers University zhenting.wang@rutgers.edu Kai Mei Rutgers University kai.mei@rutgers.edu Hailun Ding Rutgers University hailun.ding@rutgers.edu Juan Zhai Rutgers University juan.zhai@rutgers.edu Shiqing Ma Rutgers University sm2283@rutgers.edu
Pseudocode Yes Algorithm 1 Feature-space Backdoor Reverse-engineering
Open Source Code Yes Our code can be found at https://github.com/RU-System-Software-and-Security/Feature RE.
Open Datasets Yes We use four publicly available datasets to evaluate FEATURERE, including MNIST [56], GTSRB [57], CIFAR-10 [58] and Image Net [59]. We summarize our datasets in Table 1.
Dataset Splits No The paper mentions 'Evaluation Metrics' and 'Experiment Setup', and provides details on training data size and test metrics (BA, ASR), but does not specify a distinct validation dataset split (e.g., percentages or counts for a validation set).
Hardware Specification Yes All experiments are conducted on a Ubuntu 18.04 machine equipped with 64 CPUs and six Ge Force RTX 6000 GPUs.
Software Dependencies Yes We implement FEATURERE with python 3.8 and Py Torch.
Experiment Setup Yes In the end, we determine the reverse-engineering is successful and the label yt is a Trojan target label if the attack success rate of the reversed Trojan is above a threshold value λ (80% in this paper)... By default, τ1 = 0.15, τ2 = 0.25 and τ3 = 5%. We evaluate their influences. For τ1, we calculate input space perturbations on the preprocessed inputs, and the details of the preprocessing can be found in Appendix ( A.2).