Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Likelihood Ratios for Out-of-Distribution Detection
Authors: Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, Balaji Lakshminarayanan
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We design experiments on multiple data modalities (images, genomic sequences) to evaluate our method and compare with other baseline methods. We benchmark the OOD detection performance of the proposed method against existing approaches on the genomics dataset and show that our method achieves state-of-the-art performance. |
| Researcher Affiliation | Industry | Jie Ren Google Research EMAIL Peter J. Liu Google Research EMAIL Emily Fertig Google Research EMAIL Jasper Snoek Google Research EMAIL Ryan Poplin Google Research EMAIL Mark A. De Pristo Google Research EMAIL Joshua V. Dillon Google Research EMAIL Balaji Lakshminarayanan Deep Mind EMAIL |
| Pseudocode | Yes | See Algorithm 1 in Appendix A for the pseudocode for generating input perturbations. The pseudocode for our proposed OOD detection algorithm can be found in Algorithm 2 in Appendix A. |
| Open Source Code | Yes | The dataset and code for the genomics study is available at https://github.com/google-research/google-research/tree/master/genomics_ood. |
| Open Datasets | Yes | We design a new dataset for evaluating OOD methods. The dataset and code for the genomics study is available at https://github.com/google-research/google-research/tree/master/genomics_ood. (a) Fashion-MNIST as in-distribution and MNIST as OOD, (b) CIFAR-10 as in-distribution and SVHN as OOD. |
| Dataset Splits | Yes | We choose two cutoff years, 2011 and 2016, to define the training, validation, and test splits (Figure 4). Our dataset contains of 10 in-distribution classes, 60 OOD classes for validation, and 60 OOD classes for testing. We trained the model using only in-distribution inputs, and we tuned the hyperparameters using validation datasets that include both in-distribution and OOD inputs. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper mentions models like Pixel CNN++ and LSTM, and acknowledges the Google TensorFlow Probability team, but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | The rate µ is a hyperparameter and can be easily tuned using a small amount of validation OOD dataset (different from the actual OOD dataset of interest). In the case where validation OOD dataset is not available, we show that µ can also be tuned using simulated OOD data. In practice, we observe that µ 2 [0.1, 0.2] achieves good performance empirically for most of the experiments in our paper. Besides adding perturbations to the input data, we found other techniques that can improve model generalization and prevent model memorization, such as adding L2 regularization with coefficient λ to model weights, can help to train a good background model. |