Model-Agnostic Adversarial Detection by Random Perturbations

Authors: Bo Huang, Yi Wang, Wei Wang

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations are performed on the MNIST, CIFAR10 and Image Net datasets. The results demonstrate that our detection method is effective and resilient against various attacks including black-box attacks and the powerful CW attack with four adversarial adaptations.
Researcher Affiliation Academia 1Dongguan University of Technology, Dongguan, China 2Shenzhen University, Shenzhen, China 3The University of New South Wales, Sydney, Australia
Pseudocode No The paper describes the steps of the approach in paragraph format in Section 3.1 "Main Steps" but does not provide structured pseudocode or an algorithm block.
Open Source Code No The paper states "Our implementations are based on the Cleverhans2.0 library 2. https://github.com/tensorflow/cleverhans", but this refers to a third-party library used for attacks, not the authors' own source code for their proposed methodology.
Open Datasets Yes We evaluate the performance of our approach on detecting adversarial examples for the task of image classification over three benchmark datasets: MNIST, CIFAR-10, and Image Net.
Dataset Splits Yes For MNIST and CIFAR-10, we used the designated training set for training and the designated test set for testing. For Image Net, we used a pretrained DNN classifier and the first 10, 000 samples of validation set as our test examples for evaluation. We regard adversarial examples as the positive class and natural images as the negative class, and randomly select 80% of samples from each class to train the detector classifier, and use the remaining 20% for test.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU/CPU models or cloud instances.
Software Dependencies Yes Our implementations are based on the Cleverhans2.0 library 2.
Experiment Setup Yes We apply a random perturbation η drawn i.i.d. from the Gaussian distribution N(0, diag(σ)), and measure the relative score difference for ˆc as rˆc = F(x)[ˆc] F(x + η)[ˆc] F(x)[ˆc] To account for the stochastic nature of such raw signals, we decide to repeat the process m times and extract statistically robust feature from such sampled distribution. For example, we extract an 17-dimensional feature vector by taking the 10%, 15%, 20%, . . . , 90% quantiles of m samples so that it can be more robust to noise and outliers. We then train a binary classifier 1 for the adversarial example detection. We use an SVM (with RBF kernel) classifier in our experiments. Here, κ = 2.0537 for m = 50 and σ = 0.05 for CIFAR-10. We regard adversarial examples as the positive class and natural images as the negative class, and randomly select 80% of samples from each class to train the detector classifier, and use the remaining 20% for test.