Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Revisiting Logit Distributions for Reliable Out-of-Distribution Detection

Authors: Jiachen Liang, RuiBing Hou, Minyang Hu, Hong Chang, Shiguang Shan, Xilin Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on both vision-language and vision-only models demonstrate that Logit Gap consistently achieves state-of-the-art performance across diverse OOD detection scenarios and benchmarks.
Researcher Affiliation	Academia	1 State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China 2 University of Chinese Academy of Sciences (CAS), China EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using prose and mathematical formulas, including theorems and proof sketches, but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/GIT-LJc/Logit Gap.
Open Datasets	Yes	We evaluate the effectiveness of Logit Gap across multiple OOD detection benchmarks. Specifically, we use Image Net [7] or Image Net-100 as ID datasets, while NINCO [2], Image Net-OOD [55], and Image Net-O [19] are used as OOD datasets. ... we use CIFAR-10 [24] as ID dataset and adopt the standard Open OOD benchmark splits for OOD evaluation. The OOD benchmarks include both near-OOD datasets: CIFAR-100, Tiny Image Net, and far-OOD datasets: MNIST [8], SVHN [36], Texture [5], and Places365 [61].
Dataset Splits	Yes	We use three standard metrics commonly used in OOD detection literature: (1) False Positive Rate (FPR95): Measures the probability that an OOD sample is misclassified as ID when the true positive rate of ID samples is fixed at 95%. ... To this end, we firstly construct a small validation set randomly sampled from the in-distribution (ID) data, with a fixed size of 100 samples. ... Following the Open OOD protocol [54], we use CIFAR-10 [24] as ID dataset and adopt the standard Open OOD benchmark splits for OOD evaluation.
Hardware Specification	Yes	We run all OOD detection experiments on NVIDIA Ge Force RTX-4090Ti GPUs with Pytorch 2.3.1.
Software Dependencies	Yes	We run all OOD detection experiments on NVIDIA Ge Force RTX-4090Ti GPUs with Pytorch 2.3.1.
Experiment Setup	Yes	For datasets with a large number of classes (e.g., Image Net and Image Net-100), we set N to 20% of the total number of classes K. In contrast, for datasets with fewer classes (e.g., Image Net-10 and Image Net-20), we set N to 50% of K. ... For datasets with a large number of categories, such as Image Net [7] and Image Net-100 [33], we set the interpolation parameters to α = 0.3 and β = 0.8. For smaller-scale datasets like Image Net-10 [33] and Image Net-20 [33], we set α = 0.3 and β = 0.0.