Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Debugging Tests for Model Explanations
Authors: Julius Adebayo, Michael Muelly, Ilaria Liccardi, Been Kim
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive control experiments to assess several feature attribution methods against 4 bugs: spurious correlation artifact , mislabelled training examples, re-initialized weights, and out-of-distribution (OOD) shift. 4. Human Subject Study. We conduct a 54-person IRB-approved study to assess whether end-users can identify defective models with attributions. |
| Researcher Affiliation | Collaboration | Julius Adebayo , Michael Muelly , Ilaria Liccardi , Been Kim EMAIL EMAIL Massachusetts Institute of Technology Google Inc |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We refer to: https://github.com/adebayoj/explaindebug.git, for code to replicate our findings and experiments. |
| Open Datasets | Yes | We use dog breeds from the Cats-v-Dogs dataset [45] and Bird species from the Caltech-UCSD dataset [66]. |
| Dataset Splits | No | The paper mentions training, validation, and test sets (e.g., 'The model achieves a 93.2, 91.7, 88 percent accuracy on the training, validation, and test sets.'), but it does not specify the exact percentages or counts for these splits required for reproduction. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments (e.g., specific GPU or CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | We consider a birds-vs-dogs binary classification task. ... train a CNN with 5 convolutional layers and 3 fully-connected layers (we refer to this architecture as BVD-CNN from here on) with Re LU activation functions but sigmoid in the final layer. The model achieves a test accuracy of 94-percent. ... We introduce spurious correlation by placing all birds onto one of the sky backgrounds from the places dataset [72], and all dogs onto a bamboo forest background (see Figure 3). ... We instantiate this bug on a pre-trained VGG-16 model on Imagenet [52]. |