Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Can you trust your model's uncertainty? Evaluating predictive uncertainty under dataset shift
Authors: Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, Jasper Snoek
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a largescale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of dataset shift on accuracy and calibration. |
| Researcher Affiliation | Industry | Yaniv Ovadia Google Research EMAIL Emily Fertig Google Research EMAIL Jie Ren Google Research EMAIL Zachary Nado Google Research EMAIL D Sculley Google Research EMAIL Sebastian Nowozin Google Research EMAIL Joshua V. Dillon Google Research EMAIL Balaji Lakshminarayanan Deep Mind EMAIL Jasper Snoek Google Research EMAIL |
| Pseudocode | No | No pseudocode or algorithm blocks found in the paper. |
| Open Source Code | Yes | In addition to answering the questions above, our code is made available open-source along with our model predictions such that researchers can easily evaluate their approaches on these benchmarks 4. 4https://github.com/google-research/google-research/tree/master/uq_benchmark_2019 |
| Open Datasets | Yes | We evaluate the behavior of the predictive uncertainty of deep learning models on a variety of datasets across three different modalities: images, text and categorical (online ad) data. For each we follow standard training, validation and testing protocols... MNIST dataset... CIFAR-10 (Krizhevsky, 2009)... Image Net (Deng et al., 2009)... SVHN dataset (Netzer et al., 2011)... 20newsgroups dataset (Lang, 1995)... One Billion Word Benchmark (LM1B) (Chelba et al., 2013)... Criteo Display Advertising Challenge |
| Dataset Splits | No | For each we follow standard training, validation and testing protocols, but we additionally evaluate results on increasingly shifted data and an OOD dataset. The paper mentions 'standard training, validation and testing protocols' but does not provide specific percentages or counts for these splits, nor does it cite the exact predefined splits with authors and year. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, memory amounts, or detailed computer specifications) are provided for running experiments. |
| Software Dependencies | No | The paper mentions optimizers like Adam and Adagrad, but does not provide specific version numbers for software dependencies or libraries used. |
| Experiment Setup | Yes | Hyperparameters were tuned for all methods using Bayesian optimization (Golovin et al., 2017) (except on Image Net) as detailed in Appendix A.8. We detail the models and implementations used in Appendix A. |