reproducibilityindex.ai

Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift

Authors: Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, Jasper Snoek

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a largescale benchmark of existing state-of-the-art methods on classiﬁcation problems and investigate the effect of dataset shift on accuracy and calibration.
Researcher Affiliation	Industry	Yaniv Ovadia Google Research yovadia@google.com Emily Fertig Google Research emilyaf@google.com Jie Ren Google Research jjren@google.com Zachary Nado Google Research znado@google.com D Sculley Google Research dsculley@google.com Sebastian Nowozin Google Research nowozin@google.com Joshua V. Dillon Google Research jvdillon@google.com Balaji Lakshminarayanan Deep Mind balajiln@google.com Jasper Snoek Google Research jsnoek@google.com
Pseudocode	No	No pseudocode or algorithm blocks found in the paper.
Open Source Code	Yes	In addition to answering the questions above, our code is made available open-source along with our model predictions such that researchers can easily evaluate their approaches on these benchmarks 4. 4https://github.com/google-research/google-research/tree/master/uq_benchmark_2019
Open Datasets	Yes	We evaluate the behavior of the predictive uncertainty of deep learning models on a variety of datasets across three different modalities: images, text and categorical (online ad) data. For each we follow standard training, validation and testing protocols... MNIST dataset... CIFAR-10 (Krizhevsky, 2009)... Image Net (Deng et al., 2009)... SVHN dataset (Netzer et al., 2011)... 20newsgroups dataset (Lang, 1995)... One Billion Word Benchmark (LM1B) (Chelba et al., 2013)... Criteo Display Advertising Challenge
Dataset Splits	No	For each we follow standard training, validation and testing protocols, but we additionally evaluate results on increasingly shifted data and an OOD dataset. The paper mentions 'standard training, validation and testing protocols' but does not provide specific percentages or counts for these splits, nor does it cite the exact predefined splits with authors and year.
Hardware Specification	No	No specific hardware details (GPU/CPU models, memory amounts, or detailed computer specifications) are provided for running experiments.
Software Dependencies	No	The paper mentions optimizers like Adam and Adagrad, but does not provide specific version numbers for software dependencies or libraries used.
Experiment Setup	Yes	Hyperparameters were tuned for all methods using Bayesian optimization (Golovin et al., 2017) (except on Image Net) as detailed in Appendix A.8. We detail the models and implementations used in Appendix A.