Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Explaining Visual Counterfactual Explainers

Authors: Diego Velazquez, Pau Rodriguez, Alexandre Lacoste, Issam H. Laradji, Xavier Roca, Jordi Gonzàlez

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we address these problems by proposing an evaluation method with a principled metric to evaluate and compare different counterfactual explanation methods. The evaluation is based on a synthetic dataset where images are fully described by their annotated attributes. As a result, we are able to perform a fair comparison of multiple explainability methods in the recent literature, obtaining insights about their performance. We make the code1 and data public to the research community. [...] (iv) we evaluate 6 explainers across different dataset configurations (Section 4).
Researcher Affiliation Collaboration Diego Velazquez EMAIL Computer Vision Center Pau Rodriguez EMAIL Apple Alexandre Lacoste EMAIL Service Now Issam H. Laradji EMAIL Service Now Xavier Roca EMAIL Universitat Autonoma de Barcelona Jordi Gonzalez EMAIL Universitat Autonoma de Barcelona
Pseudocode Yes Algorithm 1 Orthogonal Set Input: original sample z Rd, successful counterfactuals esc Rn d, threshold τ Output: an orthogonal set of counterfactuals sc esc z ; // calculate perturbation vector indices argsort( sc 1) ; // sort perturbations by increasing norm orth sc[indices[0]] ; // initialize set of orthogonal perturbations for i = 1 to n do
Open Source Code Yes We make the code1 and data public to the research community. 1https://github.com/dvd42/Bex
Open Datasets Yes We design a synthetic benchmark based on the synbols dataset (Lacoste et al., 2020).
Dataset Splits No The paper mentions using a "validation dataset" for selection of samples and that attributes are standardized using the "training set," implying splits, but does not provide specific percentages or counts for training, validation, and test splits of the overall dataset. It states: "To save time, instead of producing counterfactuals for the entire validation dataset we select a balanced subset with a total of 800 correctly and incorrectly classified samples with different levels of confidence."
Hardware Specification Yes The total run-time for all experiments is ~37 hours on a single Titan-X GPU.
Software Dependencies No The paper states: "The code is written in Py Torch (Paszke et al., 2017)" and mentions using "Adam W (Loshchilov & Hutter, 2017)", but it does not provide specific version numbers for PyTorch or any other libraries/solvers.
Experiment Setup Yes The encoder is based on Big GAN s (Brock et al., 2018; Rodríguez et al., 2021) discriminator architecture with a classifier on top and it is trained on Synbols (Lacoste et al., 2020) dataset. Given an image x we task the encoder with predicting the attributes that describe it. It is trained for 100 epochs with a batch size of 64. We use Adam W (Loshchilov & Hutter, 2017) with a learning rate of 0.001 and a weight decay of 0.0001 with a cosine annealing learning rate scheduler(Loshchilov & Hutter, 2016). [...] The classifiers we set to explain are Res Net-18 (He et al., 2016) architectures trained on the different benchmarks described in Section 4.1. All the classifiers are trained for 10 epochs with a batch size of 256. We use Adam W with a learning rate of 0.01 and a weight decay of 0.0001 with a cosine annealing learning rate scheduler.