Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Geometric Explanation of the Likelihood OOD Detection Paradox

Authors: Hamidreza Kamkari, Brendan Leigh Ross, Jesse C. Cresswell, Anthony L. Caterini, Rahul Krishnan, Gabriel Loaiza-Ganem

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experiments, Setup We compare datasets within two classes: (i) 28 28 greyscale images, including FMNIST, MNIST, Omniglot (Lake et al., 2015), and EMNIST (Cohen et al., 2017); and (ii) RGB images resized to 32 32 3, comprising SVHN, CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009), Tiny Image Net (Le & Yang, 2015), and a simplified, cropped version of Celeb A (Kist, 2021). We give experimental details on model training in Appendix D.1 and Appendix D.2.
Researcher Affiliation Collaboration 1Layer 6 AI 2University of Toronto 3Vector Institute. Correspondence to: Hamidreza Kamkari, Brendan Leigh Ross, Jesse C. Cresswell, Anthony L. Caterini, Gabriel Loaiza-Ganem <EMAIL>, Rahul G. Krishnan <EMAIL>.
Pseudocode Yes Algorithm 1 Dual threshold OOD detection, returns True if x is deemed OOD, and False if deemed in-distribution.
Open Source Code Yes Our code is available at https://github.com/ layer6ai-labs/dgm_ood_detection.
Open Datasets Yes We compare datasets within two classes: (i) 28 28 greyscale images, including FMNIST (Xiao et al., 2017), MNIST (Le Cun et al., 1998), Omniglot (Lake et al., 2015), and EMNIST (Cohen et al., 2017); and (ii) RGB images resized to 32 32 3, comprising SVHN (Netzer et al., 2011), CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009), Tiny Image Net (Le & Yang, 2015), and a simplified, cropped version of Celeb A (Kist, 2021).
Dataset Splits No The paper mentions 'training data' and 'test data' in various contexts (e.g., 'A-train', 'A-test'), but no specific 'validation' split percentages, sample counts, or explicit methodology for validation set partitioning are provided.
Hardware Specification Yes We used an NVIDIA Tesla V100 SXM2 with 7 hours of GPU time to train each of the models.
Software Dependencies No The paper mentions software components like 'diffusers library' and optimizers 'Adam'/'Adam W', but does not provide specific version numbers for any programming languages, libraries, or frameworks used in the experiments.
Experiment Setup Yes We trained both Glow (Kingma & Dhariwal, 2018) and RQ-NSFs (Durkan et al., 2019) on our datasets, with the hyperparameters detailed in Table 3.