Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Revisiting Deep Hybrid Models for Out-of-Distribution Detection

Authors: Paul-Ruben Schlumbom, Eibe Frank

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As there are no implementations available, we set out to reproduce the approach by carefully filling in gaps in the description of the algorithm. Although we were unable to attain 100% OOD detection rates, and our results indicate that such performance is impossible on the CIFAR-10 benchmark, we achieved good OOD performance. We provide a detailed analysis of when the architecture fails and argue that it introduces an adversarial relationship between the classification component and the density estimator, rendering it highly sensitive to the balance of these two components and yielding a collapsed feature space without careful fine-tuning.
Researcher Affiliation	Academia	Paul-Ruben Schlumbom EMAIL Department of Computer Science University of Waikato Eibe Frank EMAIL Department of Computer Science University of Waikato
Pseudocode	No	The paper describes methods and architectures in prose and refers to figures, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	Yes	Our implementation of DHMs is publicly available1. 1https://github.com/P-Schlumbom/deep-hybrid-models
Open Datasets	Yes	The CIFAR-10 and CIFAR100 datasets (Krizhevsky et al., 2009)... Similarly, the SVHN dataset (Netzer et al., 2011) is often employed as benchmark OOD data for CIFAR-10.
Dataset Splits	Yes	Following the training regime described by Cao & Zhang (2022), we train the DHM model on CIFAR10 training data and then evaluate the OOD detection performance on CIFAR-100 and SVHN test sets compared to the CIFAR-10 test set. This is done by computing the density assigned to each image by the DHM s normalising flow and then computing the area under the receiver operating curve (Au ROC) and the area under the precision-recall curve (Au PR-in), where we treat ID samples as the positive class and OOD samples as the negative class. This is a standard evaluation procedure in the OOD detection literature, see Hendrycks & Gimpel (2016).
Hardware Specification	Yes	When training for 200 epochs on an NVIDIA RTX 2080 Ti, this DHM configuration could be trained in about 17 and a half hours.
Software Dependencies	No	The paper mentions optimizers like SGD and Adam, and architectures like WRN-28-10, but does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	The feature extractor and classifier head are trained with SGD, using Nesterov momentum with a momentum of 0.9, an initial learning rate of 0.05, and a weight decay rate of 5e-4. The learning rate is scaled by 0.2 at 60, 120, and 160 epochs, and the model (unless specified otherwise) is trained for 200 epochs with a batch size of 256. The spectral normalisation coefficient c is set to 3.0. The residual flow component is trained using Adam optimisation with a learning rate of 1e-4 and a weight decay rate of 16e-4.