Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Counterfactual Concept Bottleneck Models

Authors: Gabriele Dominici, Pietro Barbiero, Francesco Giannini, Martin Gjoreski, Giuseppe Marra, Marc Langheinrich

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that CF-CBMs: achieve classification accuracy comparable to black-box models and existing CBMs ( What? ), rely on fewer important concepts leading to simpler explanations ( How? ), and produce interpretable, concept-based counterfactuals ( Why not? ). This section describes essential information about experiments. We provide further details in Appendix C.
Researcher Affiliation Collaboration Gabriele Dominici Università della Svizzera italiana EMAIL Pietro Barbiero IBM Research EMAIL Francesco Giannini Scuola Normale Superiore EMAIL Martin Gjoreski Università della Svizzera italiana EMAIL Giuseppe Marra KU Leuven EMAIL Marc Langheinrich Università della Svizzera italiana EMAIL
Pseudocode No The paper describes the methodology using mathematical equations and textual explanations in Section 3, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code of this paper is publicly available1. 1https://github.com/gabriele-dominici/Counterfactual-CBM
Open Datasets Yes In our experiments we use five different datasets commonly used to evaluate CBMs: d Sprites (Matthey et al., 2017), ... MNIST addition (Manhaeve et al., 2018), ... CUB (Welinder et al., 2010), ... CIFAR10 (Krizhevsky et al.), ... SIIM Pneumothorax (Zawacki et al., 2019)
Dataset Splits No The paper mentions using "validation on a subset of the training" and refers to a "test set" and "training samples", but does not specify explicit percentages or sample counts for these splits. It also does not explicitly state the use of standard predefined splits for the datasets.
Hardware Specification Yes The experiments were performed on a device equipped with an M3 Max and 36GB of RAM, without the use of a GPU.
Software Dependencies Yes For our experiments, we implement all baselines and methods in Python 3.9 and relied upon open-source libraries such as Py Torch 2.0 (Paszke et al., 2019) (BSD license), Pytorch Lightning v2.1.2 (Apache Licence 2.0), Sklearn 1.2 (Pedregosa et al., 2011) (BSD license). In addition, we used Matplotlib (Hunter, 2007) 3.7 (BSD license) to produce the plots shown in this paper.
Experiment Setup Yes Table 5 shows the number of epochs, learning rate, and embedding size in the latent space, batch size for each dataset. They are shared among the baselines, and we took the best checkpoint out of the entire training for each model. In addition, Table 6 illustrates the parameters used to weight each term in the loss for all the methods.