Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Shortcuts and Identifiability in Concept-based Models from a Neuro-Symbolic Lens

Authors: Samuele Bortolotti, Emanuele Marconato, Paolo Morettin, Andrea Passerini, Stefano Teso

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results highlight the impact of RSs and show that existing methods, even combined with multiple natural mitigation strategies, often fail to meet these conditions in practice. ... 5 Case Studies: We tackle the following key research questions: Q1. Are CBMs affected by JRSs in practice? Q2. Do JRSs affect interpretability and OOD behavior? Q3. Can existing mitigation strategies prevent JRSs? Appendix A reports additional details about the tasks, architectures, and model selection. ... We evaluate several (also Ne Sy) CBMs. ... We use three representative Ne Sy tasks with explicit concept annotations and prior knowledge. MNIST-Add [41] ... MNIST-Sum Parity ... Clevr [83] ... BDD-OIA [84]. ... For each CBM, we evaluate predicted labels and concepts with the F1 score (resp. F1(Y ) and F1(C)) on the test split.
Researcher Affiliation	Academia	Samuele Bortolotti DISI, University of Trento Italy EMAIL Emanuele Marconato DISI, University of Trento Italy EMAIL Paolo Morettin DISI, University of Trento Italy EMAIL Andrea Passerini DISI, University of Trento Italy EMAIL Stefano Teso CIMe C and DISI, University of Trento Italy EMAIL
Pseudocode	No	The paper describes methodologies and theoretical frameworks but does not include any clearly labeled pseudocode or algorithm blocks. The methods are explained in descriptive text.
Open Source Code	Yes	We will release an open-source implementation of our code upon paper acceptance. Reviewers can inspect the code in the supplementary materials. ... The complete codebase is publicly available at https://github.com/samuelebortolotti/joint-reasoning-shortcuts.
Open Datasets	Yes	All data sets were generated using the rsbench library [22]. ... MNIST-Add [41] consists of pairs of MNIST digits [45] ... Clevr [83] consists of of 3D scenes ... BDD-OIA [84] is an autonomous driving dataset.
Dataset Splits	Yes	Overall, MNIST-Add has 42,000 training examples, 12,000 validation examples, and 6,000 test examples. ... Overall, Clevr has 6000 training examples, 1200 validation examples, and 1800 test examples. ... The training set contains almost 16k fully labeled frames, while the validation and test sets include almost 2k and 4.5k annotated samples, respectively.
Hardware Specification	Yes	All the experiments were implemented using Python 3.9 and Pytorch 1.13 and run on one A100 GPU.
Software Dependencies	Yes	All the experiments were implemented using Python 3.9 and Pytorch 1.13 and run on one A100 GPU.
Experiment Setup	Yes	Hyperparameter search. We performed an extensive grid search on the validation set over the following hyperparameters: (i) Learning rate (γ) in {1e-4, 1e-3, 1e-2}; (ii) Weight decay (ω) in {0, 1e-4, 1e-3, 1e-2, 1e-1}; (iii) Reconstruction, contrastive loss, entropy loss, concept supervision loss and knowledge supervision loss weights (wr, wc, wh, wcsup and wk, respectively) in {0.1, 1, 2, 5, 8, 10}; (iv) Batch size (ν) in {32, 64, 128, 256, 512}. ... All experiments were run for approximately 50 epochs for MNIST variants, 30 for BDD-OIA and 100 epochs for Clevr using early stopping based on validation set F1(Y ) performance. ... we used the Adam optimizer [105], while for DSL and DPL we achieved better results using the Madgrad optimizer [106].