Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Leaving No OOD Instance Behind: Instance-Level OOD Fine-Tuning for Anomaly Segmentation

Authors: Yuxuan Zhang, Zhenbo Shi, han ye, Shuchang Wang, Zhidong Yu, Shaowei Wang, Wei Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that integrating LNOIB into various OOD ﬁne-tuning strategies yields signiﬁcant improvements, particularly in component-level results, highlighting its strength in comprehensive anomaly segmentation. We evaluate our approach on various AS benchmarks.
Researcher Affiliation	Academia	1School of Computer Science and Technology, University of Science and Technology of China 2Suzhou Institute for Advanced Research, University of Science and Technology of China 3Hefei National Laboratory, University of Science and Technology of China 4Institute of Artiﬁcial Intelligence and Blockchain, Guangzhou University
Pseudocode	Yes	Justiﬁcation: We have described the details that allow us to reproduce the results in Section 4 and supplementary material. As LNOIB is a versatile mechanisms for current OOD ﬁne-tuning approaches, we give the pseudo code of how to incorporate LNOIB into EM. Moreover, all of our source code will be publicly available upon publication.
Open Source Code	Yes	More details will be available at: https://github.com/yuxuan357/LNOIB
Open Datasets	Yes	As to the ID dataset, we adopt the Cityscapes dataset [9] for pre-training... For OOD datasets, we evaluate our approach on various AS benchmarks. The Fishyscapes benchmark [1] includes two datasets: Fishyscapes Static (FS Static) and Fishyscapes Lost & Found (FS L&F). ...SMIYC benchmark [3] consists of two separate datasets: Road Anomaly (SMIYC-RA) and Road Obstacle (SMIYC-RO)... Additionally, the Road Anomaly dataset [30]... We adopt Anomaly Mix [47] to sample 297 images from COCO [28] and mix them into Cityscapes to generate outlier images...
Dataset Splits	Yes	As to the ID dataset, we adopt the Cityscapes dataset [9] for pre-training, which includes 2975 training and 500 validation images, containing 19 different urban scene categories. For OOD datasets, we evaluate our approach on various AS benchmarks. The Fishyscapes benchmark [1] includes two datasets: Fishyscapes Static (FS Static) and Fishyscapes Lost & Found (FS L&F). The former contains 30 validation images from blending Pascal [12], and the latter is based on Lost and Found dataset [39], with 100 validation images. SMIYC benchmark [3] consists of two separate datasets: Road Anomaly (SMIYC-RA) and Road Obstacle (SMIYC-RO), which contain 10 and 30 validation images with road anomalies and obstacles, respectively. Additionally, the Road Anomaly dataset [30], which served as a precursor to SMIYC, includes 60 images with anomalies located in or near the road for validation.
Hardware Specification	Yes	As to the hardware, we use a server running Ubuntu 22.04, equipped with 4 RTX 3090Ti GPUs, each with 24 GB of memory, as well as another server with 2 NVIDIA A100 GPUs, each with 80 GB of memory.
Software Dependencies	No	We adopt the Pytorch framework to conduct the training and evaluation process.
Experiment Setup	Yes	During the OOD ﬁne-tuning stage, our LNOIB objectives are built on the global-level losses used in EM, PEBAL, and M2A, respectively, with parameters set as α = 0.5, β = 0.5, M = 1, τ = 0.7, and γ1 = γ2 = 1, empirically. For detailed formulations of LNOIB objective, please refer to Appendix B. We select features from stages 2, 3, and 4 of each backbone, upsample them to 1/4 of the image size, and incorporate them to calculate Lfeat. For consistency, we ﬁne-tune the whole segmentation model for each method using their corresponding conﬁgurations.