Interpretation of Neural Networks Is Fragile
Authors: Amirata Ghorbani, Abubakar Abid, James Zou3681-3688
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We systematically characterize the robustness of interpretations generated by several widely-used feature importance interpretation methods (feature importance maps, integrated gradients, and Deep LIFT) on Image Net and CIFAR-10. In all cases, our experiments show that systematic perturbations can lead to dramatically different interpretations without changing the label. We extend these results to show that interpretations based on exemplars (e.g. influence functions) are similarly susceptible to adversarial attack. |
| Researcher Affiliation | Academia | Amirata Ghorbani, Abubakar Abid, James Zou Stanford University 450 Serra Mall, Stanford, CA, USA {amiratag, a12d, jamesz}@stanford.edu |
| Pseudocode | Yes | Algorithm 1 Iterative feature importance Attacks |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code for its described methodology. |
| Open Datasets | Yes | Data sets and models For attacks against feature importance interpretation, we used ILSVRC2012 (Image Net classification challenge data) (Russakovsky et al. 2015) and CIFAR-10 (Krizhevsky 2009). For the Image Net classification data set, we used a pre-trained Squeeze Net model introduced by (Iandola et al. 2016). For both data sets, the results are examined on feature importance scores obtained by simple gradient, integrated gradients, and Deep LIFT methods. For Deep LIFT, we used the pixel-wise and the channel-wise mean images as the CIFAR10 and Image Net reference points respectively. For the integrated gradients method, the same references were used with parameter M = 100. We ran all iterative attack algorithms for P = 300 iterations with step size α = 0.5. To evaluate our adversarial attack against influence functions, we followed a similar experimental setup to that of the original authors: we trained an Inception Net v3 with all but the last layer frozen (the weights were pre-trained on Image Net and obtained from Keras). The last layer was trained on a binary flower classification task (roses vs. sunflowers), using a data set consisting of 1,000 training images2. This data set was chosen because it consisted of images that the network had not seen during pre-training on Image Net. The network achieved a validation accuracy of 97.5%. |
| Dataset Splits | No | No explicit training/test/validation split percentages or counts are provided for the ImageNet and CIFAR-10 datasets, beyond mentioning they were used for evaluation. For the flower dataset, it states "1,000 training images" and a "validation accuracy of 97.5%", but no specific number or percentage for validation set size. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments. |
| Software Dependencies | No | The paper mentions 'Keras' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We ran all iterative attack algorithms for P = 300 iterations with step size α = 0.5. For Deep LIFT, we used the pixel-wise and the channel-wise mean images as the CIFAR10 and Image Net reference points respectively. For the integrated gradients method, the same references were used with parameter M = 100. |