Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Integral Imprecise Probability Metrics

Authors: Siu Lun (Alan) Chau, Michele Caprio, Krikamol Muandet

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Section 6 Empirical validation of MMI with selective prediction experiments. This section demonstrates the practicality of the proposed MMI measure. Additional ablation studies and detailed experiment descriptions are provided in Appendix C. The code to reproduce our experiments is here [124]. We evaluate MMI by its ability to capture informative EU in K-class classification...Figure 1: Accuracy-Rejection (AR) curves on four classification tasks. The area under the curve (AUC) is reported for numerical comparison. We consistently outperform entropy difference (E-Diff) and match the performance of Generalised Hartley (GH). On large-scale problems, our efficient upper bound (MMI-Lin) remains tractable and continues to outperform E-Diff.
Researcher Affiliation	Academia	Siu Lun Chau Nanyang Technological University Singapore Michele Caprio University of Manchester United Kingdom Krikamol Muandet RI Lab, CISPA Saarbrücken, Germany
Pseudocode	No	The paper includes definitions, theorems, lemmas, and propositions but does not contain any clearly labeled pseudocode or algorithm blocks. Procedural steps are described in narrative text.
Open Source Code	Yes	The code to reproduce our experiments is here [124].
Open Datasets	Yes	We evaluate on two UCI datasets [128, 129] and CIFAR-10/100 [130].
Dataset Splits	Yes	Each dataset is split into training and test sets...For tabular datasets (the Obesity dataset from UCI and Digits dataset from Sci-kit learn), we train 10 random forests with randomly chosen hyperparameters (e.g., tree depth) on the same training set, and evaluate them on the test set. For image datasets (CIFAR10 and CIFAR100), we use 10 pretrained neural networks per task...we evaluate the standard CIFAR test sets by dividing them into 10 buckets to introduce variability.
Hardware Specification	Yes	The experiments were executed on a machine with 8 v CPUs, 30 GB memory, with a NVIDIA V100 GPU.
Software Dependencies	No	The paper mentions 'Py Torch's default CIFAR training sets' and 'random forests' but does not provide specific version numbers for any software libraries or frameworks used in the experiments.
Experiment Setup	Yes	For tabular datasets..., we train 10 random forests with randomly chosen hyperparameters (e.g., tree depth) on the same training set... For image datasets..., we use 10 pretrained neural networks per task... For both tabular and image data, we use the centroid of the credal set as the predictor, similar to standard ensemble methods. The corresponding lower probability is computed by evaluating the most pessimistic likelihood across the set of predictions for each possible outcome.