A Benchmark for Interpretability Methods in Deep Neural Networks
Authors: Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose an empirical measure of the approximate accuracy of feature importance estimates in deep neural networks. Our results across several large-scale image classification datasets show that many popular interpretability methods produce estimates of feature importance that are not better than a random designation of feature importance. |
| Researcher Affiliation | Industry | Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim Google Brain shooker,dumitru,pikinder,beenkim@google.com |
| Pseudocode | No | The paper describes the ROAR methodology and various interpretability methods using text and mathematical formulas (e.g., in Section 3 and 4.1), but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | However, we welcome the opportunity to consider additional estimators in the future, and in order to make it easy to apply ROAR to additional estimators we have open sourced our code https://bit.ly/2tt LLZB. |
| Open Datasets | Yes | We applied ROAR in a broad set of experiments across three large scale, open source image datasets: Image Net [10], Food 101 [8] and Birdsnap [7]. |
| Dataset Splits | No | The paper states, "For all train and validation images in the dataset we first apply test time pre-processing as used by Goyal et al. [13]." and "We generate new train and test datasets at different degradation levels t = [0., 10, . . . , 100]". While it mentions train and test sets, it does not specify the exact percentages or sample counts for training, validation, and test splits for the original or modified datasets. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments. |
| Software Dependencies | No | The paper mentions using "Py Torch [14] and Tensorflow [1]" implementations for ResNet-50. However, it does not specify the version numbers for these software components or any other ancillary software dependencies, which is required for reproducibility. |
| Experiment Setup | No | The paper describes general aspects of the experimental setup, such as retraining models from random initialization and repeating training 5 times, and applying test-time pre-processing ("We independently train 5 Res Net-50 models from random initialization on each of these modified dataset and report test accuracy as the average of these 5 runs."). However, it does not provide specific hyperparameter values like learning rate, batch size, or optimizer settings needed to replicate the training process. |