Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift

Authors: Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, Jasper Snoek

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a largescale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of dataset shift on accuracy and calibration.
Researcher Affiliation Industry Yaniv Ovadia Google Research yovadia@google.com Emily Fertig Google Research emilyaf@google.com Jie Ren Google Research jjren@google.com Zachary Nado Google Research znado@google.com D Sculley Google Research dsculley@google.com Sebastian Nowozin Google Research nowozin@google.com Joshua V. Dillon Google Research jvdillon@google.com Balaji Lakshminarayanan Deep Mind balajiln@google.com Jasper Snoek Google Research jsnoek@google.com
Pseudocode No No pseudocode or algorithm blocks found in the paper.
Open Source Code Yes In addition to answering the questions above, our code is made available open-source along with our model predictions such that researchers can easily evaluate their approaches on these benchmarks 4. 4https://github.com/google-research/google-research/tree/master/uq_benchmark_2019
Open Datasets Yes We evaluate the behavior of the predictive uncertainty of deep learning models on a variety of datasets across three different modalities: images, text and categorical (online ad) data. For each we follow standard training, validation and testing protocols... MNIST dataset... CIFAR-10 (Krizhevsky, 2009)... Image Net (Deng et al., 2009)... SVHN dataset (Netzer et al., 2011)... 20newsgroups dataset (Lang, 1995)... One Billion Word Benchmark (LM1B) (Chelba et al., 2013)... Criteo Display Advertising Challenge
Dataset Splits No For each we follow standard training, validation and testing protocols, but we additionally evaluate results on increasingly shifted data and an OOD dataset. The paper mentions 'standard training, validation and testing protocols' but does not provide specific percentages or counts for these splits, nor does it cite the exact predefined splits with authors and year.
Hardware Specification No No specific hardware details (GPU/CPU models, memory amounts, or detailed computer specifications) are provided for running experiments.
Software Dependencies No The paper mentions optimizers like Adam and Adagrad, but does not provide specific version numbers for software dependencies or libraries used.
Experiment Setup Yes Hyperparameters were tuned for all methods using Bayesian optimization (Golovin et al., 2017) (except on Image Net) as detailed in Appendix A.8. We detail the models and implementations used in Appendix A.