Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improving Perturbation-based Explanations by Understanding the Role of Uncertainty Calibration

Authors: Thomas Decker, Volker Tresp, Florian Buettner

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations across diverse models and datasets demonstrate that Re Cal X consistently reduces perturbation-specific miscalibration most effectively while enhancing explanation robustness and the identification of globally important input features. We conducted a comprehensive empirical evaluation of Re Cal X considering neural classifiers on different tabular datasets [23, 22] and various computer vision models on the Image Net ILSVRC2012 dataset. Detailed descriptions of all experiments and further computational details are provided in Appendix B, while additional results supporting each finding are presented in Appendix C. Accompanying source code is available at https://github.com/thomdeck/recalx.
Researcher Affiliation	Collaboration	Thomas Decker1,2,3 Volker Tresp2,3 Florian Buettner4,5,6 1Siemens AG 2LMU Munich 3Munich Center for Machine Learning (MCML) 4Goethe University Frankfurt 5German Cancer Research Center (DKFZ) 5German Cancer Consortium (DKTK) EMAIL, EMAIL, EMAIL
Pseudocode	No	A more detailed description of the algorithm is presented in Appendix B.
Open Source Code	Yes	Accompanying source code is available at https://github.com/thomdeck/recalx.
Open Datasets	Yes	We conducted a comprehensive empirical evaluation of Re Cal X considering neural classifiers on different tabular datasets [23, 22] and various computer vision models on the Image Net ILSVRC2012 dataset.
Dataset Splits	Yes	To implement Re Cal X, we selected 200 random validation samples from each considered dataset and used 10 perturbed instances per considered perturbation level. This results in a calibration set of 2000 samples per bin. We explicitly evaluated the KL-Divergence-based calibration error using the consistent and asymptotically unbiased estimator proposed by [51] to be fully aligned with our theoretical analysis above. Each error is derived based on at least 5000 unseen samples from each dataset.
Hardware Specification	No	We provide experimental details including information on the compute resources in Appendix B.
Software Dependencies	No	Detailed descriptions of all experiments and further computational details are provided in Appendix B
Experiment Setup	Yes	To implement Re Cal X, we selected 200 random validation samples from each considered dataset and used 10 perturbed instances per considered perturbation level. This results in a calibration set of 2000 samples per bin. The temperature parameter T is optimized on a held-out validation set, typically by minimizing the cross-entropy loss LCE in the case of classification. To address the second requirement, we propose to go beyond classical temperature scaling and introduce Re Cal X as a generalized version. It aims to reduce the calibration error under all perturbations faced during the explanation process by scaling logits using an adaptive temperature that depends on the perturbation level implied by S. Formally, for a subset S {1, . . . , d}, we define the perturbation level λ(S) as the fraction of perturbed features: λ(S) = (d \|S\|)/d [0, 1]. To account for different perturbation intensities, we partition [0, 1] into B equal-width bins and learn a specific temperature for each bin. Given a validation set Dval = (xi, yi)N i=1, we optimize a temperature Tb for each bin by minimizing the cross-entropy loss LCE on perturbed samples with corresponding perturbation levels.