Robust Feature-Level Adversaries are Interpretability Tools
Authors: Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test three methods of introducing adversarial features into source images either by modifying the generator s latents and/or inserting a generated patch into natural images. In contrast to previous works that have enforced the adversarialness of attacks only by inserting small features or restricting the distance between an adversary and a benign input, we also introduce methods that regularize the feature to be perceptible yet disguised to resemble something other than the target class. We show that our method produces robust attacks that provide actionable insights into a network s learned representations. Fig. 1 demonstrates the interpretability benefits of this type of feature-level attack. It compares a conventional, pixel-level, adversarial patch, created using the method from [6], with a feature-level attack using our method. While both attacks attempt to make a network misclassify a bee as a fly, the pixel-level attack exhibits high-frequency patterns and lacks visually-coherent objects. On the other hand, the feature-level attacks displays easily describable features: the colored circles. We can validate this insight by considering the network performance when a picture of a traffic light is inserted into the image a bee. In this example, the image classification moves from a 55% confidence that the image is a bee to a 97% confidence that the image is of a fly. Section. 4.2 studies these types of copy/paste attacks more in depth. |
| Researcher Affiliation | Academia | Stephen Casper 123, Max Nadeau 234, Dylan Hadfield-Menell1, Gabriel Kreiman23 1MIT CSAIL; 2 Boston Children s Hospital, Harvard Medical School; 3Center for Brains, Minds, and Machines; 4Harvard College, Harvard University scasper@mit.edu mnadeau@college.harvard.edu Equal Contribution |
| Pseudocode | No | The paper includes a diagram (Figure 2) depicting its pipeline, but it does not contain any structured pseudocode or algorithm blocks with labeled steps. |
| Open Source Code | Yes | Code is available at https://github.com/thestephencasper/feature_level_adv. |
| Open Datasets | Yes | This method works on Image Net scale models and creates robust, featurelevel adversarial examples. ... By default, we attacked a Res Net50 [21], restricting patch attacks to 1/16 of the image and region and generalized patch attacks to 1/8. (ImageNet is cited as [53] in the bibliography) |
| Dataset Splits | No | The paper mentions generating attacks and averaging over source images but does not provide specific training/validation/test dataset splits, percentages, or explicit sample counts for reproduction. For instance: "For each method, we generated universal attacks with random target classes until we obtained 250 successfully disguised ones in which the resulting adversarial feature was not classified by the network as the target class when viewed on its own. Fig. 4 plots the success rate versus the distribution of target class mean confidences for each type of attack. Each is an average over 100 source images." This describes evaluation metrics but not dataset splits. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using "Big GAN generators from [5, 71]", "Big GAN discriminator and adversarially trained classifiers from [13]", and attacking a "Res Net50 [21]". It also references PyTorch in the bibliography ([49]). However, it does not specify version numbers for any of these software components, libraries, or frameworks (e.g., "PyTorch 1.9"). |
| Experiment Setup | Yes | We use Big GAN generators from [5, 71], and perturb the post-Re LU outputs of the internal Gen Blocks. We also found that training slight perturbations to the Big GAN s inputs improved performance. We used the Big GAN discriminator and adversarially trained classifiers from [13] for disguise regularization. By default, we attacked a Res Net50 [21], restricting patch attacks to 1/16 of the image and region and generalized patch attacks to 1/8. Appendix A.2 has additional details. The paper also details the loss functions used in Equations (1) and (2). |