Red Teaming Deep Neural Networks with Feature Synthesis Tools
Authors: Stephen Casper, Tong Bu, Yuxiao Li, Jiawei Li, Kevin Zhang, Kaivalya Hariharan, Dylan Hadfield-Menell
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose trojan discovery as an evaluation task for interpretability tools and introduce a benchmark with 12 trojans of 3 different types. (2) We demonstrate the difficulty of this benchmark with a preliminary evaluation of 16 state-of-the-art feature attribution/saliency tools. (3) We evaluate 7 feature-synthesis methods on our benchmark. (4) We introduce and evaluate 2 new variants of the best-performing method from the previous evaluation. |
| Researcher Affiliation | Academia | Stephen Casper MIT CSAIL scasper@mit.edu Yuxiao Li Tsinghua University Jiawei Li Tsinghua University Tong Bu Peking University Kevin Zhang Peking University Kaivalya Hariharan MIT Dylan Hadfield-Menell MIT CSAIL |
| Pseudocode | No | The paper describes procedures and methods but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at this https url, and a website for this paper is available at this https url. ... Code for SNAFUE is available at https://github.com/thestephencasper/snafue. |
| Open Datasets | Yes | We implanted trojans via finetuning for two epochs over the training set with data poisoning [12, 19]. ... We used a total of N = 265,457 natural images from five sources: the Image Net validation set [60] (50,000) Tiny Image Net [38] (100,000), Open Surfaces [5] (57,500), the non Open Surfaces images from Broden [4] (37,953). |
| Dataset Splits | Yes | We implanted trojans via finetuning for two epochs over the training set with data poisoning [12, 19]. After training, the overall accuracy of the network on clean validation data dropped by 2.9 percentage points. ... We pass validation set images through the network... evaluated all K natural patches under random insertion locations over all 50 source images from the validation set |
| Hardware Specification | No | The paper states that 'The total compute needed for trojan implantation and all experiments involved no GPU parallelism and was comparable to other works on training and evaluating Image Net-scale convolutional networks.' However, it does not specify any particular GPU models, CPU models, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions software like 'Captum library [36]' and 'Lucent library for visualization [44]' but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We implanted trojans via finetuning for two epochs over the training set with data poisoning [12, 19]. ... Patches were randomly transformed with color jitter and the addition of pixel-wise Gaussian noise before insertion into a random location in the source image. ... All synthetic patches were parameterized as 64 x 64 images. Each was trained under transformations, including random resizing. Similarly, all natural patches were 64 x 64 pixels. All adversarial patches were tested by resizing them to 100 x 100 and inserting them into 256 x 256 source images at random locations. |