Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Counterfactual Concept Bottleneck Models

Authors: Gabriele Dominici, Pietro Barbiero, Francesco Giannini, Martin Gjoreski, Giuseppe Marra, Marc Langheinrich

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results demonstrate that CF-CBMs: achieve classification accuracy comparable to black-box models and existing CBMs ( What? ), rely on fewer important concepts leading to simpler explanations ( How? ), and produce interpretable, concept-based counterfactuals ( Why not? ). This section describes essential information about experiments. We provide further details in Appendix C.
Researcher Affiliation	Collaboration	Gabriele Dominici Università della Svizzera italiana EMAIL Pietro Barbiero IBM Research EMAIL Francesco Giannini Scuola Normale Superiore EMAIL Martin Gjoreski Università della Svizzera italiana EMAIL Giuseppe Marra KU Leuven EMAIL Marc Langheinrich Università della Svizzera italiana EMAIL
Pseudocode	No	The paper describes the methodology using mathematical equations and textual explanations in Section 3, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code of this paper is publicly available1. 1https://github.com/gabriele-dominici/Counterfactual-CBM
Open Datasets	Yes	In our experiments we use five different datasets commonly used to evaluate CBMs: d Sprites (Matthey et al., 2017), ... MNIST addition (Manhaeve et al., 2018), ... CUB (Welinder et al., 2010), ... CIFAR10 (Krizhevsky et al.), ... SIIM Pneumothorax (Zawacki et al., 2019)
Dataset Splits	No	The paper mentions using "validation on a subset of the training" and refers to a "test set" and "training samples", but does not specify explicit percentages or sample counts for these splits. It also does not explicitly state the use of standard predefined splits for the datasets.
Hardware Specification	Yes	The experiments were performed on a device equipped with an M3 Max and 36GB of RAM, without the use of a GPU.
Software Dependencies	Yes	For our experiments, we implement all baselines and methods in Python 3.9 and relied upon open-source libraries such as Py Torch 2.0 (Paszke et al., 2019) (BSD license), Pytorch Lightning v2.1.2 (Apache Licence 2.0), Sklearn 1.2 (Pedregosa et al., 2011) (BSD license). In addition, we used Matplotlib (Hunter, 2007) 3.7 (BSD license) to produce the plots shown in this paper.
Experiment Setup	Yes	Table 5 shows the number of epochs, learning rate, and embedding size in the latent space, batch size for each dataset. They are shared among the baselines, and we took the best checkpoint out of the entire training for each model. In addition, Table 6 illustrates the parameters used to weight each term in the loss for all the methods.