reproducibilityindex.ai

A Benchmark for Interpretability Methods in Deep Neural Networks

Authors: Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose an empirical measure of the approximate accuracy of feature importance estimates in deep neural networks. Our results across several large-scale image classiﬁcation datasets show that many popular interpretability methods produce estimates of feature importance that are not better than a random designation of feature importance.
Researcher Affiliation	Industry	Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim Google Brain shooker,dumitru,pikinder,beenkim@google.com
Pseudocode	No	The paper describes the ROAR methodology and various interpretability methods using text and mathematical formulas (e.g., in Section 3 and 4.1), but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	However, we welcome the opportunity to consider additional estimators in the future, and in order to make it easy to apply ROAR to additional estimators we have open sourced our code https://bit.ly/2tt LLZB.
Open Datasets	Yes	We applied ROAR in a broad set of experiments across three large scale, open source image datasets: Image Net [10], Food 101 [8] and Birdsnap [7].
Dataset Splits	No	The paper states, "For all train and validation images in the dataset we ﬁrst apply test time pre-processing as used by Goyal et al. [13]." and "We generate new train and test datasets at different degradation levels t = [0., 10, . . . , 100]". While it mentions train and test sets, it does not specify the exact percentages or sample counts for training, validation, and test splits for the original or modified datasets.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies	No	The paper mentions using "Py Torch [14] and Tensorﬂow [1]" implementations for ResNet-50. However, it does not specify the version numbers for these software components or any other ancillary software dependencies, which is required for reproducibility.
Experiment Setup	No	The paper describes general aspects of the experimental setup, such as retraining models from random initialization and repeating training 5 times, and applying test-time pre-processing ("We independently train 5 Res Net-50 models from random initialization on each of these modiﬁed dataset and report test accuracy as the average of these 5 runs."). However, it does not provide specific hyperparameter values like learning rate, batch size, or optimizer settings needed to replicate the training process.