reproducibilityindex.ai

Debugging Tests for Model Explanations

Authors: Julius Adebayo, Michael Muelly, Ilaria Liccardi, Been Kim

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive control experiments to assess several feature attribution methods against 4 bugs: spurious correlation artifact , mislabelled training examples, re-initialized weights, and out-of-distribution (OOD) shift. 4. Human Subject Study. We conduct a 54-person IRB-approved study to assess whether end-users can identify defective models with attributions.
Researcher Affiliation	Collaboration	Julius Adebayo , Michael Muelly , Ilaria Liccardi , Been Kim {juliusad,licardi}@mit.edu {muelly,beenkim}@google.com Massachusetts Institute of Technology Google Inc
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	We refer to: https://github.com/adebayoj/explaindebug.git, for code to replicate our ﬁndings and experiments.
Open Datasets	Yes	We use dog breeds from the Cats-v-Dogs dataset [45] and Bird species from the Caltech-UCSD dataset [66].
Dataset Splits	No	The paper mentions training, validation, and test sets (e.g., 'The model achieves a 93.2, 91.7, 88 percent accuracy on the training, validation, and test sets.'), but it does not specify the exact percentages or counts for these splits required for reproduction.
Hardware Specification	No	The paper does not specify any particular hardware used for running the experiments (e.g., specific GPU or CPU models, memory, or cloud instance types).
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup	Yes	We consider a birds-vs-dogs binary classiﬁcation task. ... train a CNN with 5 convolutional layers and 3 fully-connected layers (we refer to this architecture as BVD-CNN from here on) with Re LU activation functions but sigmoid in the ﬁnal layer. The model achieves a test accuracy of 94-percent. ... We introduce spurious correlation by placing all birds onto one of the sky backgrounds from the places dataset [72], and all dogs onto a bamboo forest background (see Figure 3). ... We instantiate this bug on a pre-trained VGG-16 model on Imagenet [52].