Meaningfully debugging model mistakes using conceptual counterfactual explanations
Authors: Abubakar Abid, Mert Yuksekgonul, James Zou
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach on well-known pretrained models, showing that it explains the models mistakes meaningfully. In addition, for new models trained on data with spurious correlations, CCE accurately identifies the spurious correlation as the cause of model mistakes from a single misclassified test sample. On two challenging medical applications, CCE generated useful insights, confirmed by clinicians, into biases and mistakes the model makes in real-world settings. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering, Stanford University 2Department of Computer Science, Stanford University 3Department of Biomedical Data Science, Stanford University, Stanford, CA 94305. |
| Pseudocode | Yes | Algorithm 1 Conceptual Counterfactual Explanations(CCE) ... Algorithm 2 Learning concept vectors |
| Open Source Code | Yes | The code for CCE is publicly available at https://github.com/mertyg/debug-mistakes-cce. |
| Open Datasets | Yes | We use the Meta Dataset (Liang & Zou, 2021), a collection of labeled datasets of animals in different settings and with different objects. ... train a Res Net18 model to predict one of the 114 skin conditions using the Fitzpatrick17k dataset of 16,577 annotated skin images (Groh et al., 2021). ... trained on a dataset collected from the National Institutes of Health Clinical Center in Bethesda (NIH)(Wang et al., 2017) and we test the model with images obtained from the Stanford Health Care in Palo Alto (SHC)(Irvin et al., 2019). |
| Dataset Splits | Yes | To measure whether concepts are successfully learned, we keep a hold-out validation set and measure the validation accuracy, disregarding concepts with accuracies below a threshold (0.7 in our experiments, which left us with 168 of the 170 concepts). ... We randomly partition the dataset into training(80%) and testing(20%) sets. |
| Hardware Specification | No | The paper mentions that 'each test example takes < 0.3 seconds on a single CPU' in the conclusion, but does not provide any specific CPU model, speed, memory, or details about the hardware used for training the models or running the main experiments. |
| Software Dependencies | No | The paper mentions using the 'imagecorruptions' library and the 'Adam' optimizer, but it does not provide specific version numbers for these or any other key software components, such as programming languages or deep learning frameworks. |
| Experiment Setup | Yes | Throughout the experiments, unless otherwise stated, we use α = 0.1, β = 0.01, γ = 0.01, η = 0.9, κi = 0. ... We use a Res Net18 (He et al., 2016) backbone pretrained on Image Net(Deng et al., 2009) and we fine-tune a classification head to predict one of the 114 skin conditions using Adam(Kingma & Ba, 2014) optimizer. The classification head consists of 1) a fully-connected layer with 256 hidden units 2) relu activation 3) dropout layer with a 40% probability of masking activations 4) another fully-connected layer with the number of predicted categories. |