Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Causally Reliable Concept Bottleneck Models

Authors: Giovanni De Felice, Arianna Casanova Flores, Francesco De Santis, Silvia Santini, Johannes Schneider, Pietro Barbiero, Alberto Termine

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the performance of the proposed C2BM pipeline. Experiments are conducted across different datasets and settings, allowing for the investigation of the following aspects: classification accuracy (Sec. 5.1), causal reliability (Sec. 5.1), accuracy under ground-truth interventions (Sec. 5.2), debiasing (Sec. 5.3), and fairness (Sec. 5.4). App. G provides additional results and ablations.
Researcher Affiliation	Collaboration	Giovanni De Felice Università della Svizzera Italiana EMAIL Arianna Casanova Flores University of Liechtenstein Francesco De Santis Politecnico di Torino Silvia Santini Università della Svizzera Italiana Johannes Schneider University of Liechtenstein Pietro Barbiero IBM Research Alberto Termine Scuola Universitaria Professionale della Svizzera Italiana, IDSIA
Pseudocode	No	The paper describes the C2BM pipeline and its components using text, flowcharts (Figure 2), and mathematical equations. However, it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Python code for reproducing all experiments is provided alongside the submission as supplementary material.
Open Datasets	Yes	The considered datasets include both synthetic and real-world benchmarks. As synthetic datasets, we sample 104 points from each of the five following discrete Bayesian networks available from the bnlearn repository (Scutari, 2010): Asia (Lauritzen & Spiegelhalter, 1988), Sachs (Sachs et al., 2005), Insurance (Binder et al., 1997), Alarm (Beinlich et al., 1989), and Hailfinder (Abramson et al., 1996). We include c MNIST, a variant of the original dataset (Le Cun et al., 2010)... Additionally, we consider three real-world datasets: Celeb A (Liu et al., 2015)... CUBC, a custom version of the original bird image dataset (He & Peng, 2019)... Siim-Pneumothorax (You et al., 2023)... Datasets licenses are as follows: bnlearn datasets (CC-BY-SA), MNIST (CC BY-SA), CUB (CC0: Public Domain), Celeb A (Creative Commons Non Commercial license), Pneumothorax (CC BY 4.0).
Dataset Splits	Yes	For each network, we generate 10000 samples and create training, validation, and test datasets using a 70% 10% 20% split. ...reserve 10% of the training set for validation. ...The dataset is then split into training, validation, and test sets using a traditional 70% 10% 20% partition. ...further split the training set such that 10% is used for validation.
Hardware Specification	Yes	All experiments are conducted on NVIDIA Ge Force RTX 3080 and NVIDIA RTX A5000 GPUs.
Software Dependencies	No	The paper mentions several software components and libraries like 'Adam optimizer (Kingma & Ba, 2015)', 'torchvision library (Marcel & Rodriguez, 2010)', 'causal-learn Python library (Zheng et al., 2024)', and 'py-tetrad library (Ramsey & Andrews, 2023)'. However, it does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup	Yes	The batch size is set to 512 for most datasets, with the exception of Siim-pneumothorax and SCBM, where it is reduced to 128 due to memory constraints. All models are trained using the Adam optimizer (Kingma & Ba, 2015) for a maximum of 500 epochs, with early stopping based on a 30-epoch patience. We employ Leaky Re LU as the activation function throughout. ...L = (1 α) Ltask + α Lconcepts, with α = 0.8 ...Additionally, we apply random training-time interventions as proposed by Zarlenga et al. (2022), with an intervention probability of 0.25. ...Key hyperparameters, including learning rate, MLP hidden size, and dropout rate, are selected via grid search.