Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection

Authors: Xin Gao, Jiyao Liu, Guanghao Li, Yueming Lyu, Jianxiong Gao, Weichen Yu, Ningsheng Xu, Liang Wang, Caifeng Shan, Ziwei Liu, Chenyang Si

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform thorough quantitative and qualitative analyses to evaluate the effectiveness of GOOD, demonstrating that training with samples generated by GOOD can notably enhance OOD detection performance. We conduct extensive experiments on multiple benchmarks, including Imagenet, CIFAR-100, CIFAR-10, to evaluate the effectiveness of GOOD. Results show that incorporating GOOD-generated samples into OE training significantly improves detection performance, outperforming existing post-hoc and synthesized outlier-based methods.
Researcher Affiliation	Academia	1Nanjing University 2Fudan University 3Carnegie Mellon University 4Chinese Academy of Sciences 5Nanyang Technological University
Pseudocode	Yes	Algorithm 1 Training-Free Guidance for OOD Sampling (GOOD)
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We plan to release code and instructions upon publication to support full reproducibility. Details for reproducing results are described in Section 4 and the Appendix.
Open Datasets	Yes	We use Image Net-100 [10] and CIFAR-100 as the ID training datasets, following [15, 13, 44, 7]. For Image Net-100, the OOD test data is sourced from [32], which includes i Naturalist [63], SUN [66], Places [74], and Textures [9]. For CIFAR-100, we evaluate on five OOD test sets: SVHN [49], Places365 [74], LSUN-R [72], ISUN [67], and Textures [9].
Dataset Splits	Yes	For Image Net-100, the OOD test data is sourced from [32], which includes i Naturalist [63], SUN [66], Places [74], and Textures [9]. For CIFAR-100, we evaluate on five OOD test sets: SVHN [49], Places365 [74], LSUN-R [72], ISUN [67], and Textures [9]. Next, we study the semantic effect of guidance using a 32 32 DDPM model trained on CIFAR-10. We partition the 10 classes into two halves: five seen classes are used to train a classifier, while the other five are held out.
Hardware Specification	No	Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Section 4.2 mentions the use of Stable Diffusion and Res Net classifiers, and training time (e.g., 50/200 epochs). Specific hardware specs (e.g., GPU types) can be included in the supplementary material.
Software Dependencies	Yes	To improve the classifier s ability to differentiate between OOD and ID data, we generate OOD samples using Stable Diffusion v1.5 [52] guided by our proposed method.
Experiment Setup	Yes	The loss weighting parameter λ is 2.5. Optimization is performed using stochastic gradient descent with momentum (0.9) and a weight decay of 5 10 4. The model is trained for 50 epochs on Image Net-100 and 200 epochs on CIFAR-100, with a cosine learning rate schedule that starts with an initial learning rate of 10 4 and a batch size of 160. We select γ based on resolution: 0.1 for low-resolution CIFAR and 0.001 for high-resolution Image Net.