Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Credal Prediction based on Relative Likelihood

Authors: Timo Löhr, Paul Hofman, Felix Mohr, Eyke Hüllermeier

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate our approach, we illustrate its effectiveness by experiments on benchmark datasets demonstrating superior uncertainty representation without compromising predictive performance. We also compare our method against several state-of-the-art baselines in credal prediction.
Researcher Affiliation	Academia	Timo Löhr LMU Munich, MCML EMAIL Hofman LMU Munich, MCML EMAIL Mohr Universidad de La Sabana EMAIL Hüllermeier LMU Munich, MCML, DFKI EMAIL
Pseudocode	Yes	Algorithm 1 Train Credal Relative Likelihood ensemble.
Open Source Code	Yes	The code for all implementations and experiments is published in a Github repository1. 1https://github.com/timoverse/credal-prediction-relative-likelihood
Open Datasets	Yes	Chaos NLI is publicly available under the Creative Commons Attribution-Non Commercial 4.0 International (CC BY-NC 4.0) license.
Dataset Splits	Yes	The CIFAR-10 dataset... is partitioned into 50,000 training images and 10,000 test images, organized into five training batches and one test batch, each containing 10,000 images.
Hardware Specification	Yes	CPU AMD EPYC MILAN 7413 Processor, 24C/48T 2.65GHz 128MB L3 Cache GPU 2 NVIDIA A40 (48 GB GDDR each) RAM 128 GB (4x 32GB) DDR4-3200MHz ECC DIMM Storage 2 480GB Samsung Datacenter SSD PM893
Software Dependencies	No	The paper mentions using PyTorch, TensorFlow Datasets, and SciPy, as well as optimizers like Adam and SGD. However, it does not provide specific version numbers for these software components. For example: "For experiments on the CIFAR-10 dataset, we use the Py Torch Res Net-18 implementation and hyperparameters provided by https://github.com/kuangliu/pytorch-cifar."
Experiment Setup	Yes	Each dataset is trained using a dedicated set of hyperparameters as presented in Table 2. We evaluated multiple configurations and selected the best-performing ones for each dataset. To ensure fair and consistent comparisons, all models trained on a given dataset, both our approach and the baselines, use the same hyperparameter settings. The only exception is the Cre BNN, which requires a KL-divergence penalty of 1e 7 and zero weight decay when using the Adam optimizer [Kingma and Ba, 2015]. When we apply the SGD optimizer with a learning rate scheduler, namely Cosine Annealing [Loshchilov and Hutter, 2017], Cre BNN requires additionally a momentum of 0.9 to enable effective learning. Table 2: Hyperparameters used for each dataset. Hyperparameter Chaos NLI CIFAR-10 Quality MRI Model FCNet Res Net18 Res Net18 Epochs 300 200 200 Learning rate 0.01 0.1 0.01 Weight decay 0.0 0.0005 0.0005 Optimizer Adam SGD SGD Ensemble members 20 20 20 LR scheduler Cosine Annealing Cosine Annealing Tobias value 100 100 100