Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Geometric Explanation of the Likelihood OOD Detection Paradox
Authors: Hamidreza Kamkari, Brendan Leigh Ross, Jesse C. Cresswell, Anthony L. Caterini, Rahul Krishnan, Gabriel Loaiza-Ganem
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Experiments, Setup We compare datasets within two classes: (i) 28 28 greyscale images, including FMNIST, MNIST, Omniglot (Lake et al., 2015), and EMNIST (Cohen et al., 2017); and (ii) RGB images resized to 32 32 3, comprising SVHN, CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009), Tiny Image Net (Le & Yang, 2015), and a simplified, cropped version of Celeb A (Kist, 2021). We give experimental details on model training in Appendix D.1 and Appendix D.2. |
| Researcher Affiliation | Collaboration | 1Layer 6 AI 2University of Toronto 3Vector Institute. Correspondence to: Hamidreza Kamkari, Brendan Leigh Ross, Jesse C. Cresswell, Anthony L. Caterini, Gabriel Loaiza-Ganem <EMAIL>, Rahul G. Krishnan <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Dual threshold OOD detection, returns True if x is deemed OOD, and False if deemed in-distribution. |
| Open Source Code | Yes | Our code is available at https://github.com/ layer6ai-labs/dgm_ood_detection. |
| Open Datasets | Yes | We compare datasets within two classes: (i) 28 28 greyscale images, including FMNIST (Xiao et al., 2017), MNIST (Le Cun et al., 1998), Omniglot (Lake et al., 2015), and EMNIST (Cohen et al., 2017); and (ii) RGB images resized to 32 32 3, comprising SVHN (Netzer et al., 2011), CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009), Tiny Image Net (Le & Yang, 2015), and a simplified, cropped version of Celeb A (Kist, 2021). |
| Dataset Splits | No | The paper mentions 'training data' and 'test data' in various contexts (e.g., 'A-train', 'A-test'), but no specific 'validation' split percentages, sample counts, or explicit methodology for validation set partitioning are provided. |
| Hardware Specification | Yes | We used an NVIDIA Tesla V100 SXM2 with 7 hours of GPU time to train each of the models. |
| Software Dependencies | No | The paper mentions software components like 'diffusers library' and optimizers 'Adam'/'Adam W', but does not provide specific version numbers for any programming languages, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | We trained both Glow (Kingma & Dhariwal, 2018) and RQ-NSFs (Durkan et al., 2019) on our datasets, with the hyperparameters detailed in Table 3. |