Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Evaluating Robustness of Predictive Uncertainty Estimation: Are Dirichlet-based Models Reliable?

Authors: Anna-Kathrin Kopetzki, Bertrand Charpentier, Daniel Zügner, Sandhya Giri, Stephan Günnemann

ICML 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results suggest that uncertainty estimates of DBU models are not robust w.r.t. three important tasks: (1) indicating correctly and wrongly classiﬁed samples; (2) detecting adversarial examples; and (3) distinguishing between in-distribution (ID) and out-of-distribution (OOD) data. Additionally, we explore the ﬁrst approaches to make DBU models more robust. While adversarial training has a minor effect, our median smoothing based approach signiﬁcantly increases robustness of DBU models. Experiments are performed on two image data sets (MNIST (Le Cun & Cortes, 2010) and CIFAR10 (Krizhevsky et al., 2009)), which contain bounded inputs and two tabular data sets (Segment (Dua & Graff, 2017) and Sensorless drive (Dua & Graff, 2017)), consisting of unbounded inputs.
Researcher Affiliation	Academia	1Technical University of Munich, Germany; Department of Informatics.
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	The code and further supplementary material is available online (www.daml.in.tum.de/dbu-robustness).
Open Datasets	Yes	Experiments are performed on two image data sets (MNIST (Le Cun & Cortes, 2010) and CIFAR10 (Krizhevsky et al., 2009)), which contain bounded inputs and two tabular data sets (Segment (Dua & Graff, 2017) and Sensorless drive (Dua & Graff, 2017)), consisting of unbounded inputs. As Prior Net requires OOD training data, we use two further image data sets (Fashion MNIST (Xiao et al., 2017) and CIFAR100 (Krizhevsky et al., 2009)) for training on MNIST and CIFAR10, respectively.
Dataset Splits	No	The paper states, 'Further details on the experimental setup are provided in the appendix (see Section 6.2).', but does not explicitly provide training/validation/test splits with percentages or sample counts in the main text.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) were mentioned for running experiments.
Software Dependencies	No	No specific software dependencies with version numbers were mentioned.
Experiment Setup	No	The paper states, 'Further details on the experimental setup are provided in the appendix (see Section 6.2).', but these details are not present in the provided main text.