Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Uncertainty Estimation by Flexible Evidential Deep Learning

Authors: Taeseong Yoon, Heeyoung Kim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate F-EDL across a wide range of UQ-related downstream tasks, including classification, misclassification detection, OOD detection, and distribution shift detection. Notably, F-EDL consistently achieves state-of-the-art performance across diverse settings, including classical, long-tailed, and noisy ID scenarios, highlighting its robustness and generalizability. In addition, qualitative analyses show that F-EDL captures interpretable multimodal uncertainty reflecting ambiguity across plausible classes, while demonstrating faithful epistemic behavior that decreases with more data.
Researcher Affiliation	Academia	Taeseong Yoon Heeyoung Kim Department of Industrial and Systems Engineering, KAIST EMAIL
Pseudocode	Yes	Algorithm 1 F-EDL Training Algorithm 2 F-EDL Prediction and Uncertainty Quantification
Open Source Code	Yes	The code for our model is publicly available at https://github.com/Taeseong Yoon/F-EDL.
Open Datasets	Yes	In the classical setting, CIFAR-10 and CIFAR-100 [45] were used as the primary ID datasets. For the long-tailed setting, we used CIFAR-10-LT [46], an artificially imbalanced version of CIFAR-10, as the ID dataset. For the noisy setting, DMNIST, a variant of MNIST containing ambiguous data points, was used as the ID dataset. For OOD detection, SVHN [47] and CIFAR100 served as OOD datasets when CIFAR-10 or CIFAR-10-LT was used as ID, while SVHN and Tiny Image Net (TIN) were used as the OOD datasets for CIFAR-100. FMNIST was used as the OOD dataset for DMNIST. For distribution shift detection, we utilized CIFAR-10-C [48], a dataset created by applying continuous distribution shifts to CIFAR-10, as the OOD dataset.
Dataset Splits	Yes	The dataset contains 50,000 training images and 10,000 testing images... For our experiments, the training set is further divided into training and validation subsets with a ratio of 0.95 : 0.05. ... DMNIST contains 120,000 training images (60,000 from MNIST and 60,000 from AMNIST) and 70,000 testing images (10,000 from MNIST and 60,000 from AMNIST).
Hardware Specification	Yes	Depending on availability, we used either an RTX 4060 GPU with 8GB memory or a TITAN V GPU with 12GB memory.
Software Dependencies	No	All experiments were implemented in Py Torch.
Experiment Setup	Yes	The detailed experimental setups and hyperparameter configurations are provided in Table 10. All experiments were implemented in Py Torch. Depending on availability, we used either an RTX 4060 GPU with 8GB memory or a TITAN V GPU with 12GB memory.