Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Red Teaming Deep Neural Networks with Feature Synthesis Tools
Authors: Stephen Casper, Tong Bu, Yuxiao Li, Jiawei Li, Kevin Zhang, Kaivalya Hariharan, Dylan Hadfield-Menell
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose trojan discovery as an evaluation task for interpretability tools and introduce a benchmark with 12 trojans of 3 different types. (2) We demonstrate the difficulty of this benchmark with a preliminary evaluation of 16 state-of-the-art feature attribution/saliency tools. (3) We evaluate 7 feature-synthesis methods on our benchmark. (4) We introduce and evaluate 2 new variants of the best-performing method from the previous evaluation. |
| Researcher Affiliation | Academia | Stephen Casper MIT CSAIL EMAIL Yuxiao Li Tsinghua University Jiawei Li Tsinghua University Tong Bu Peking University Kevin Zhang Peking University Kaivalya Hariharan MIT Dylan Hadfield-Menell MIT CSAIL |
| Pseudocode | No | The paper describes procedures and methods but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at this https url, and a website for this paper is available at this https url. ... Code for SNAFUE is available at https://github.com/thestephencasper/snafue. |
| Open Datasets | Yes | We implanted trojans via finetuning for two epochs over the training set with data poisoning [12, 19]. ... We used a total of N = 265,457 natural images from five sources: the Image Net validation set [60] (50,000) Tiny Image Net [38] (100,000), Open Surfaces [5] (57,500), the non Open Surfaces images from Broden [4] (37,953). |
| Dataset Splits | Yes | We implanted trojans via finetuning for two epochs over the training set with data poisoning [12, 19]. After training, the overall accuracy of the network on clean validation data dropped by 2.9 percentage points. ... We pass validation set images through the network... evaluated all K natural patches under random insertion locations over all 50 source images from the validation set |
| Hardware Specification | No | The paper states that 'The total compute needed for trojan implantation and all experiments involved no GPU parallelism and was comparable to other works on training and evaluating Image Net-scale convolutional networks.' However, it does not specify any particular GPU models, CPU models, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions software like 'Captum library [36]' and 'Lucent library for visualization [44]' but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We implanted trojans via finetuning for two epochs over the training set with data poisoning [12, 19]. ... Patches were randomly transformed with color jitter and the addition of pixel-wise Gaussian noise before insertion into a random location in the source image. ... All synthetic patches were parameterized as 64 x 64 images. Each was trained under transformations, including random resizing. Similarly, all natural patches were 64 x 64 pixels. All adversarial patches were tested by resizing them to 100 x 100 and inserting them into 256 x 256 source images at random locations. |