Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Conformalized Credal Regions for Classification with Ambiguous Ground Truth
Authors: Michele Caprio, David Stutz, Shuo Li, Arnaud Doucet
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically verify our findings on both synthetic and real datasets. We verify the proposed algorithm on three datasets with ambiguous ground truth, including the toy and Dermatology DDx datasets from Stutz et al. (2023b;c) ( Toy and Derm , respectively), and CIFAR-10H (Peterson et al., 2019b) ( Cifar10h ). |
| Researcher Affiliation | Collaboration | Michele Caprio EMAIL Department of Computer Science University of Manchester David Stutz EMAIL Deep Mind Shuo Li EMAIL Department of Computer and Information Science University of Pennsylvania Arnaud Doucet EMAIL Deep Mind Department of Statistics, University of Oxford |
| Pseudocode | Yes | Algorithm 1 Computing Imprecise Highest Density Set ISP,δ |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We verify the proposed algorithm on three datasets with ambiguous ground truth, including the toy and Dermatology DDx datasets from Stutz et al. (2023b;c) ( Toy and Derm , respectively), and CIFAR-10H (Peterson et al., 2019b) ( Cifar10h ). For Derm, whose details are discussed in Appendix C, we use risk labels as classes, classifying cases into low, medium and high risk. Such a dataset is from Liu et al. (2020). |
| Dataset Splits | Yes | We split each dataset into random calibration and testing sets in a 50%-50% ratio. |
| Hardware Specification | No | The paper mentions training models like an MLP and a ResNet50, but does not provide specific hardware details such as GPU/CPU models or memory specifications. |
| Software Dependencies | No | The paper mentions machine learning models (e.g., single-layer MLP, ResNet50) and techniques (e.g., conformal prediction) but does not specify any software libraries or frameworks with version numbers (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | For the Toy dataset, we train a single-layer MLP with 100 hidden neurons, achieving an accuracy of 77.5%. For CIFAR-10H, we employ a ResNet50 (He et al., 2015) model trained on the original CIFAR-10 training set, obtaining an accuracy of 93.6%. We consider different miscoverage levels ϵ, including 0.05, 0.1, 0.15, 0.2, 0.25 and 0.3. We denote the coverage level as 1 ϵ in our plots. For simplicity, we put the significance levels α and δ to ϵ/2. |