Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift
Authors: Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, Jasper Snoek
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a largescale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of dataset shift on accuracy and calibration. |
| Researcher Affiliation | Industry | Yaniv Ovadia Google Research yovadia@google.com Emily Fertig Google Research emilyaf@google.com Jie Ren Google Research jjren@google.com Zachary Nado Google Research znado@google.com D Sculley Google Research dsculley@google.com Sebastian Nowozin Google Research nowozin@google.com Joshua V. Dillon Google Research jvdillon@google.com Balaji Lakshminarayanan Deep Mind balajiln@google.com Jasper Snoek Google Research jsnoek@google.com |
| Pseudocode | No | No pseudocode or algorithm blocks found in the paper. |
| Open Source Code | Yes | In addition to answering the questions above, our code is made available open-source along with our model predictions such that researchers can easily evaluate their approaches on these benchmarks 4. 4https://github.com/google-research/google-research/tree/master/uq_benchmark_2019 |
| Open Datasets | Yes | We evaluate the behavior of the predictive uncertainty of deep learning models on a variety of datasets across three different modalities: images, text and categorical (online ad) data. For each we follow standard training, validation and testing protocols... MNIST dataset... CIFAR-10 (Krizhevsky, 2009)... Image Net (Deng et al., 2009)... SVHN dataset (Netzer et al., 2011)... 20newsgroups dataset (Lang, 1995)... One Billion Word Benchmark (LM1B) (Chelba et al., 2013)... Criteo Display Advertising Challenge |
| Dataset Splits | No | For each we follow standard training, validation and testing protocols, but we additionally evaluate results on increasingly shifted data and an OOD dataset. The paper mentions 'standard training, validation and testing protocols' but does not provide specific percentages or counts for these splits, nor does it cite the exact predefined splits with authors and year. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, memory amounts, or detailed computer specifications) are provided for running experiments. |
| Software Dependencies | No | The paper mentions optimizers like Adam and Adagrad, but does not provide specific version numbers for software dependencies or libraries used. |
| Experiment Setup | Yes | Hyperparameters were tuned for all methods using Bayesian optimization (Golovin et al., 2017) (except on Image Net) as detailed in Appendix A.8. We detail the models and implementations used in Appendix A. |