A Benchmark for Interpretability Methods in Deep Neural Networks

Authors: Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose an empirical measure of the approximate accuracy of feature importance estimates in deep neural networks. Our results across several large-scale image classification datasets show that many popular interpretability methods produce estimates of feature importance that are not better than a random designation of feature importance.
Researcher Affiliation Industry Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim Google Brain shooker,dumitru,pikinder,beenkim@google.com
Pseudocode No The paper describes the ROAR methodology and various interpretability methods using text and mathematical formulas (e.g., in Section 3 and 4.1), but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes However, we welcome the opportunity to consider additional estimators in the future, and in order to make it easy to apply ROAR to additional estimators we have open sourced our code https://bit.ly/2tt LLZB.
Open Datasets Yes We applied ROAR in a broad set of experiments across three large scale, open source image datasets: Image Net [10], Food 101 [8] and Birdsnap [7].
Dataset Splits No The paper states, "For all train and validation images in the dataset we first apply test time pre-processing as used by Goyal et al. [13]." and "We generate new train and test datasets at different degradation levels t = [0., 10, . . . , 100]". While it mentions train and test sets, it does not specify the exact percentages or sample counts for training, validation, and test splits for the original or modified datasets.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies No The paper mentions using "Py Torch [14] and Tensorflow [1]" implementations for ResNet-50. However, it does not specify the version numbers for these software components or any other ancillary software dependencies, which is required for reproducibility.
Experiment Setup No The paper describes general aspects of the experimental setup, such as retraining models from random initialization and repeating training 5 times, and applying test-time pre-processing ("We independently train 5 Res Net-50 models from random initialization on each of these modified dataset and report test accuracy as the average of these 5 runs."). However, it does not provide specific hyperparameter values like learning rate, batch size, or optimizer settings needed to replicate the training process.