Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Deferring Concept Bottleneck Models: Learning to Defer Interventions to Inaccurate Experts

Authors: Andrea Pugnana, Riccardo Massidda, Francesco Giannini, Pietro Barbiero, Mateo Espinosa Zarlenga, Roberto Pellungrini, Gabriele Dominici, Fosca Giannotti, Davide Bacciu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show that DCBMs can achieve high predictive performance and interpretability by deferring only when needed. ... We experimentally show how DCBMs react to varying costs and different human-accuracy degrees for defer (Section 4). Moreover, DCBMs can significantly improve concept-incomplete tasks. ... Next, in Section 4, we report an empirical analysis highlighting the advantages of DCBMs.
Researcher Affiliation	Collaboration	Andrea Pugnana University of Trento Riccardo Massidda University of Pisa Francesco Giannini Scuola Normale Superiore Pietro Barbiero IBM Research Mateo Espinosa Zarlenga University of Cambridge University of Oxford Roberto Pellungrini Scuola Normale Superiore Gabriele Dominici USI Fosca Giannotti Scuola Normale Superiore Davide Bacciu University of Pisa
Pseudocode	No	The paper describes the model formulation and training process using mathematical equations and textual descriptions, but it does not contain a clearly labeled section or figure with structured pseudocode or an algorithm block.
Open Source Code	Yes	We provide the code for reproducing our experiments at https://github.com/andrepugni/DCBM.
Open Datasets	Yes	Datasets. We perform our analysis on two real-world datasets: cifar10-h [Peterson et al., 2019] and CUB [Wah et al., 2011]. ... Finally, we employ the synthetic completeness [Laguna et al., 2024] dataset to study possible variants of our method, whose results we report in Appendix E.
Dataset Splits	Yes	Data Split. For the completeness synthetic dataset, we sample 1, 000 instances with an 80%-20% train-test split ratio. For cifar10h, we randomly split the dataset into training, validation and test according to a 70%, 10%, 20% ratio. For CUB, we keep the original split.
Hardware Specification	Yes	Hardware and Computational Time We train our baselines on a 224 cores machine with Intel(R) Xeon(R) Platinum 8480+ CPU and eight NVIDIA A100-SXM4-80GB, OS Ubuntu 22.04.4 LTS.
Software Dependencies	No	The paper mentions optimizers like "Adam [Kingma and Ba, 2015]" and "Adam W [Loshchilov and Hutter, 2019]" and an operating system "OS Ubuntu 22.04.4 LTS", but does not list specific versions for key software libraries or programming languages (e.g., Python, PyTorch, TensorFlow, CUDA versions) used for implementation beyond the OS.
Experiment Setup	Yes	Training Procedure. We train every combination of models and defer costs λ for 100 epochs. For completeness, we use Adam [Kingma and Ba, 2015] with a learning rate equal to .001 and no scheduler. For both cifar10-h and CUB, we use Adam W [Loshchilov and Hutter, 2019] as an optimizer, setting the initial learning rate to .001. We decrease the learning rate every 25 epochs by .5. Additionally, for CUB, following Zarlenga et al. [2022] guidelines, we consider a weighted version of the loss on concepts to take into account their imbalance. To limit the computational burden, for both cifar10-h and CUB, we perform early stopping after 10 epochs if there is no improvement for the loss on the validation set.