Red Teaming Deep Neural Networks with Feature Synthesis Tools

Authors: Stephen Casper, Tong Bu, Yuxiao Li, Jiawei Li, Kevin Zhang, Kaivalya Hariharan, Dylan Hadfield-Menell

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose trojan discovery as an evaluation task for interpretability tools and introduce a benchmark with 12 trojans of 3 different types. (2) We demonstrate the difficulty of this benchmark with a preliminary evaluation of 16 state-of-the-art feature attribution/saliency tools. (3) We evaluate 7 feature-synthesis methods on our benchmark. (4) We introduce and evaluate 2 new variants of the best-performing method from the previous evaluation.
Researcher Affiliation Academia Stephen Casper MIT CSAIL scasper@mit.edu Yuxiao Li Tsinghua University Jiawei Li Tsinghua University Tong Bu Peking University Kevin Zhang Peking University Kaivalya Hariharan MIT Dylan Hadfield-Menell MIT CSAIL
Pseudocode No The paper describes procedures and methods but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at this https url, and a website for this paper is available at this https url. ... Code for SNAFUE is available at https://github.com/thestephencasper/snafue.
Open Datasets Yes We implanted trojans via finetuning for two epochs over the training set with data poisoning [12, 19]. ... We used a total of N = 265,457 natural images from five sources: the Image Net validation set [60] (50,000) Tiny Image Net [38] (100,000), Open Surfaces [5] (57,500), the non Open Surfaces images from Broden [4] (37,953).
Dataset Splits Yes We implanted trojans via finetuning for two epochs over the training set with data poisoning [12, 19]. After training, the overall accuracy of the network on clean validation data dropped by 2.9 percentage points. ... We pass validation set images through the network... evaluated all K natural patches under random insertion locations over all 50 source images from the validation set
Hardware Specification No The paper states that 'The total compute needed for trojan implantation and all experiments involved no GPU parallelism and was comparable to other works on training and evaluating Image Net-scale convolutional networks.' However, it does not specify any particular GPU models, CPU models, or other detailed hardware specifications.
Software Dependencies No The paper mentions software like 'Captum library [36]' and 'Lucent library for visualization [44]' but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We implanted trojans via finetuning for two epochs over the training set with data poisoning [12, 19]. ... Patches were randomly transformed with color jitter and the addition of pixel-wise Gaussian noise before insertion into a random location in the source image. ... All synthetic patches were parameterized as 64 x 64 images. Each was trained under transformations, including random resizing. Similarly, all natural patches were 64 x 64 pixels. All adversarial patches were tested by resizing them to 100 x 100 and inserting them into 256 x 256 source images at random locations.