Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Integral Imprecise Probability Metrics
Authors: Siu Lun (Alan) Chau, Michele Caprio, Krikamol Muandet
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Section 6 Empirical validation of MMI with selective prediction experiments. This section demonstrates the practicality of the proposed MMI measure. Additional ablation studies and detailed experiment descriptions are provided in Appendix C. The code to reproduce our experiments is here [124]. We evaluate MMI by its ability to capture informative EU in K-class classification...Figure 1: Accuracy-Rejection (AR) curves on four classification tasks. The area under the curve (AUC) is reported for numerical comparison. We consistently outperform entropy difference (E-Diff) and match the performance of Generalised Hartley (GH). On large-scale problems, our efficient upper bound (MMI-Lin) remains tractable and continues to outperform E-Diff. |
| Researcher Affiliation | Academia | Siu Lun Chau Nanyang Technological University Singapore Michele Caprio University of Manchester United Kingdom Krikamol Muandet RI Lab, CISPA Saarbrücken, Germany |
| Pseudocode | No | The paper includes definitions, theorems, lemmas, and propositions but does not contain any clearly labeled pseudocode or algorithm blocks. Procedural steps are described in narrative text. |
| Open Source Code | Yes | The code to reproduce our experiments is here [124]. |
| Open Datasets | Yes | We evaluate on two UCI datasets [128, 129] and CIFAR-10/100 [130]. |
| Dataset Splits | Yes | Each dataset is split into training and test sets...For tabular datasets (the Obesity dataset from UCI and Digits dataset from Sci-kit learn), we train 10 random forests with randomly chosen hyperparameters (e.g., tree depth) on the same training set, and evaluate them on the test set. For image datasets (CIFAR10 and CIFAR100), we use 10 pretrained neural networks per task...we evaluate the standard CIFAR test sets by dividing them into 10 buckets to introduce variability. |
| Hardware Specification | Yes | The experiments were executed on a machine with 8 v CPUs, 30 GB memory, with a NVIDIA V100 GPU. |
| Software Dependencies | No | The paper mentions 'Py Torch's default CIFAR training sets' and 'random forests' but does not provide specific version numbers for any software libraries or frameworks used in the experiments. |
| Experiment Setup | Yes | For tabular datasets..., we train 10 random forests with randomly chosen hyperparameters (e.g., tree depth) on the same training set... For image datasets..., we use 10 pretrained neural networks per task... For both tabular and image data, we use the centroid of the credal set as the predictor, similar to standard ensemble methods. The corresponding lower probability is computed by evaluating the most pessimistic likelihood across the set of predictions for each possible outcome. |