Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conformalized Credal Regions for Classification with Ambiguous Ground Truth

Authors: Michele Caprio, David Stutz, Shuo Li, Arnaud Doucet

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically verify our findings on both synthetic and real datasets. We verify the proposed algorithm on three datasets with ambiguous ground truth, including the toy and Dermatology DDx datasets from Stutz et al. (2023b;c) ( Toy and Derm , respectively), and CIFAR-10H (Peterson et al., 2019b) ( Cifar10h ).
Researcher Affiliation	Collaboration	Michele Caprio EMAIL Department of Computer Science University of Manchester David Stutz EMAIL Deep Mind Shuo Li EMAIL Department of Computer and Information Science University of Pennsylvania Arnaud Doucet EMAIL Deep Mind Department of Statistics, University of Oxford
Pseudocode	Yes	Algorithm 1 Computing Imprecise Highest Density Set ISP,δ
Open Source Code	No	The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide a link to a code repository.
Open Datasets	Yes	We verify the proposed algorithm on three datasets with ambiguous ground truth, including the toy and Dermatology DDx datasets from Stutz et al. (2023b;c) ( Toy and Derm , respectively), and CIFAR-10H (Peterson et al., 2019b) ( Cifar10h ). For Derm, whose details are discussed in Appendix C, we use risk labels as classes, classifying cases into low, medium and high risk. Such a dataset is from Liu et al. (2020).
Dataset Splits	Yes	We split each dataset into random calibration and testing sets in a 50%-50% ratio.
Hardware Specification	No	The paper mentions training models like an MLP and a ResNet50, but does not provide specific hardware details such as GPU/CPU models or memory specifications.
Software Dependencies	No	The paper mentions machine learning models (e.g., single-layer MLP, ResNet50) and techniques (e.g., conformal prediction) but does not specify any software libraries or frameworks with version numbers (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup	Yes	For the Toy dataset, we train a single-layer MLP with 100 hidden neurons, achieving an accuracy of 77.5%. For CIFAR-10H, we employ a ResNet50 (He et al., 2015) model trained on the original CIFAR-10 training set, obtaining an accuracy of 93.6%. We consider different miscoverage levels ϵ, including 0.05, 0.1, 0.15, 0.2, 0.25 and 0.3. We denote the coverage level as 1 ϵ in our plots. For simplicity, we put the significance levels α and δ to ϵ/2.