reproducibilityindex.ai

Robust Feature-Level Adversaries are Interpretability Tools

Authors: Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test three methods of introducing adversarial features into source images either by modifying the generator s latents and/or inserting a generated patch into natural images. In contrast to previous works that have enforced the adversarialness of attacks only by inserting small features or restricting the distance between an adversary and a benign input, we also introduce methods that regularize the feature to be perceptible yet disguised to resemble something other than the target class. We show that our method produces robust attacks that provide actionable insights into a network s learned representations. Fig. 1 demonstrates the interpretability beneﬁts of this type of feature-level attack. It compares a conventional, pixel-level, adversarial patch, created using the method from [6], with a feature-level attack using our method. While both attacks attempt to make a network misclassify a bee as a ﬂy, the pixel-level attack exhibits high-frequency patterns and lacks visually-coherent objects. On the other hand, the feature-level attacks displays easily describable features: the colored circles. We can validate this insight by considering the network performance when a picture of a trafﬁc light is inserted into the image a bee. In this example, the image classiﬁcation moves from a 55% conﬁdence that the image is a bee to a 97% conﬁdence that the image is of a ﬂy. Section. 4.2 studies these types of copy/paste attacks more in depth.
Researcher Affiliation	Academia	Stephen Casper 123, Max Nadeau 234, Dylan Hadﬁeld-Menell1, Gabriel Kreiman23 1MIT CSAIL; 2 Boston Children s Hospital, Harvard Medical School; 3Center for Brains, Minds, and Machines; 4Harvard College, Harvard University scasper@mit.edu mnadeau@college.harvard.edu Equal Contribution
Pseudocode	No	The paper includes a diagram (Figure 2) depicting its pipeline, but it does not contain any structured pseudocode or algorithm blocks with labeled steps.
Open Source Code	Yes	Code is available at https://github.com/thestephencasper/feature_level_adv.
Open Datasets	Yes	This method works on Image Net scale models and creates robust, featurelevel adversarial examples. ... By default, we attacked a Res Net50 [21], restricting patch attacks to 1/16 of the image and region and generalized patch attacks to 1/8. (ImageNet is cited as [53] in the bibliography)
Dataset Splits	No	The paper mentions generating attacks and averaging over source images but does not provide specific training/validation/test dataset splits, percentages, or explicit sample counts for reproduction. For instance: "For each method, we generated universal attacks with random target classes until we obtained 250 successfully disguised ones in which the resulting adversarial feature was not classiﬁed by the network as the target class when viewed on its own. Fig. 4 plots the success rate versus the distribution of target class mean conﬁdences for each type of attack. Each is an average over 100 source images." This describes evaluation metrics but not dataset splits.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using "Big GAN generators from [5, 71]", "Big GAN discriminator and adversarially trained classiﬁers from [13]", and attacking a "Res Net50 [21]". It also references PyTorch in the bibliography ([49]). However, it does not specify version numbers for any of these software components, libraries, or frameworks (e.g., "PyTorch 1.9").
Experiment Setup	Yes	We use Big GAN generators from [5, 71], and perturb the post-Re LU outputs of the internal Gen Blocks. We also found that training slight perturbations to the Big GAN s inputs improved performance. We used the Big GAN discriminator and adversarially trained classiﬁers from [13] for disguise regularization. By default, we attacked a Res Net50 [21], restricting patch attacks to 1/16 of the image and region and generalized patch attacks to 1/8. Appendix A.2 has additional details. The paper also details the loss functions used in Equations (1) and (2).