ML-LOO: Detecting Adversarial Examples with Feature Attribution

Authors: Puyudi Yang, Jianbo Chen, Cho-Jui Hsieh, Jane-Ling Wang, Michael Jordan6639-6647

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As demonstrated in extensive experiments, our method achieves superior performances in distinguishing adversarial examples from popular attack methods on a variety of real data sets compared to stateof-the-art detection methods.
Researcher Affiliation Academia 1University of California, Davis 2University of California, Berkeley 3University of California, Los Angeles
Pseudocode No No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes The code for ML-LOO is available at our Github page. We comment here that our proposed framework of adversarial detection via feature attribution is generic to popular feature attribution methods. As an example, we show the performance of Integrated Gradients (Sundararajan, Taly, and Yan 2017) for adversarial detection in the supplementary material at https://github.com/Jianbo-Lab/ML-LOO.
Open Datasets Yes on three data sets: MNIST, CIFAR-10 and CIFAR-100, with the standard train/test split (Chollet and others 2015).
Dataset Splits No The paper mentions "standard train/test split" but does not explicitly specify a validation split or its size for model training. It does refer to training data for the detection methods: "1,000 adversarial images with the corresponding 1,000 natural images were used for the training process of LID, Mahalanobis and our method."
Hardware Specification No The paper does not explicitly describe the hardware used for experiments (e.g., specific GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper mentions "Keras" but does not provide specific version numbers for it or any other software dependencies, making it difficult to reproduce the software environment.
Experiment Setup Yes We set the confidence parameter c = 0 for C&W-LC and c = 50 for C&W-HC. For mixed-confidence C&W attack, we generate adversarial images from C&W attack with the confidence parameter in Equation (2) randomly selected from {1, 3, 5, ..., 29}... For mixed-confidence ℓ∞-PGD attack, we generated adversarial images from ℓ∞-PGD with different confidence levels by randomly selecting the constraint ε in Equation (3) from {1, 2, 3, 4, 5, 6, 7, 8}/255. The loss is minimized with Adam (Kingma and Ba 2014).