reproducibilityindex.ai

Meaningfully debugging model mistakes using conceptual counterfactual explanations

Authors: Abubakar Abid, Mert Yuksekgonul, James Zou

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach on well-known pretrained models, showing that it explains the models mistakes meaningfully. In addition, for new models trained on data with spurious correlations, CCE accurately identiﬁes the spurious correlation as the cause of model mistakes from a single misclassiﬁed test sample. On two challenging medical applications, CCE generated useful insights, conﬁrmed by clinicians, into biases and mistakes the model makes in real-world settings.
Researcher Affiliation	Academia	1Department of Electrical Engineering, Stanford University 2Department of Computer Science, Stanford University 3Department of Biomedical Data Science, Stanford University, Stanford, CA 94305.
Pseudocode	Yes	Algorithm 1 Conceptual Counterfactual Explanations(CCE) ... Algorithm 2 Learning concept vectors
Open Source Code	Yes	The code for CCE is publicly available at https://github.com/mertyg/debug-mistakes-cce.
Open Datasets	Yes	We use the Meta Dataset (Liang & Zou, 2021), a collection of labeled datasets of animals in different settings and with different objects. ... train a Res Net18 model to predict one of the 114 skin conditions using the Fitzpatrick17k dataset of 16,577 annotated skin images (Groh et al., 2021). ... trained on a dataset collected from the National Institutes of Health Clinical Center in Bethesda (NIH)(Wang et al., 2017) and we test the model with images obtained from the Stanford Health Care in Palo Alto (SHC)(Irvin et al., 2019).
Dataset Splits	Yes	To measure whether concepts are successfully learned, we keep a hold-out validation set and measure the validation accuracy, disregarding concepts with accuracies below a threshold (0.7 in our experiments, which left us with 168 of the 170 concepts). ... We randomly partition the dataset into training(80%) and testing(20%) sets.
Hardware Specification	No	The paper mentions that 'each test example takes < 0.3 seconds on a single CPU' in the conclusion, but does not provide any specific CPU model, speed, memory, or details about the hardware used for training the models or running the main experiments.
Software Dependencies	No	The paper mentions using the 'imagecorruptions' library and the 'Adam' optimizer, but it does not provide specific version numbers for these or any other key software components, such as programming languages or deep learning frameworks.
Experiment Setup	Yes	Throughout the experiments, unless otherwise stated, we use α = 0.1, β = 0.01, γ = 0.01, η = 0.9, κi = 0. ... We use a Res Net18 (He et al., 2016) backbone pretrained on Image Net(Deng et al., 2009) and we ﬁne-tune a classiﬁcation head to predict one of the 114 skin conditions using Adam(Kingma & Ba, 2014) optimizer. The classiﬁcation head consists of 1) a fully-connected layer with 256 hidden units 2) relu activation 3) dropout layer with a 40% probability of masking activations 4) another fully-connected layer with the number of predicted categories.