Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Out-of-Distribution Detection with Relative Angles

Authors: Berker Demirel, Marco Fumero, Francesco Locatello

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we first test ORA on a large-sale Image Net OOD benchmark [Deng et al., 2009] that spans nine models including Conv Ne Xt [Liu et al., 2022], Swin [Liu et al., 2021], Dei T [Touvron et al., 2021] and EVA [Fang et al., 2023] to establish the performance beyond the usual Res Net-50 [He et al., 2016] setting. We then demonstrate how ORA benefits from contrastively learned features (i) on CLIP [Radford et al., 2021] in both zero-shot and linear-probe modes, and (ii) on CIFAR-10 [Krizhevsky, 2009] and Image Net with Res Net18/50 checkpoints trained with supervised contrastive (SCL) loss [Khosla et al., 2020]. Next, we show that ORA s scale-invariant scores can be ensemble-summed across architectures for additional gains, and that it pairs seamlessly with post-hoc activation-shaping methods such as Re Act and ASH. Finally, we present an ablation study to asses the contributions of key design choices.
Researcher Affiliation	Academia	Berker Demirel Institute of Science and Technology Austria 3400 Klosterneuburg, Austria EMAIL Marco Fumero Institute of Science and Technology Austria 3400 Klosterneuburg, Austria EMAIL Francesco Locatello Institute of Science and Technology Austria 3400 Klosterneuburg, Austria EMAIL
Pseudocode	Yes	D Algorithm Box We present the pseudocode for ORA in Algorithm Box 1. It depicts how ORA assigns a score given a sample x, pretrained model f, and the ID statistics, mean of the in-distribution features µID.
Open Source Code	Yes	Code is available at https://github.com/berkerdemirel/ORA-OOD-Detection-with-Relative-Angles.
Open Datasets	Yes	Benchmarks. We consider two widely used benchmarks: CIFAR-10 [Krizhevsky, 2009] and Image Net [Deng et al., 2009]). We included the evaluation on CIFAR-10 OOD Benchmark to show the performance on smaller scale datasets. In CIFAR-10 experiments, we use a pretrained Res Net-18 architecture He et al. [2016] trained with supervised contrastive loss [Khosla et al., 2020], following previous literature Liu and Qin [2024], Sun et al. [2022], Sehwag et al. [2021]. During inference 10.000 test samples are used to set the in-distribution scores and choose the threshold value λ; while the datasets SVHN [Netzer et al., 2011], i SUN [Xu et al., 2015], Places365 [Zhou et al., 2017] and Texture [Cimpoi et al., 2014] are used to obtain out-of-distribution scores and metric evaluation. For the large-scale Image Net OOD benchmark, we extend prior evaluations [Liu and Qin, 2024, Sun et al., 2022, Park et al., 2023, Ren et al., 2021, Sun et al., 2021, Xu et al., 2024b] by going beyond Res Net-50 and including a diverse set of nine models, such as Conv Ne Xt [Liu et al., 2022], Swin Transformer [Liu et al., 2021], Dei T [Touvron et al., 2021], and EVA [Fang et al., 2023], along with their Image Net-21k pretrained counterparts when available. This broader evaluation allows a more comprehensive assessment of OOD detection performance across modern architectures. A validation set of 50,000 Image Net samples is used to set ID scores and the threshold, while the OOD datasets include i Naturalist [Van Horn et al., 2018], SUN [Xiao et al., 2010], Places365 [Zhou et al., 2017], and Texture [Cimpoi et al., 2014].
Dataset Splits	Yes	During inference 10.000 test samples are used to set the in-distribution scores and choose the threshold value λ; while the datasets SVHN [Netzer et al., 2011], i SUN [Xu et al., 2015], Places365 [Zhou et al., 2017] and Texture [Cimpoi et al., 2014] are used to obtain out-of-distribution scores and metric evaluation. For the large-scale Image Net OOD benchmark, [...] A validation set of 50,000 Image Net samples is used to set ID scores and the threshold, while the OOD datasets include i Naturalist [Van Horn et al., 2018], SUN [Xiao et al., 2010], Places365 [Zhou et al., 2017], and Texture [Cimpoi et al., 2014].
Hardware Specification	Yes	All experiments are evaluated on a single Nvidia H100 GPU.
Software Dependencies	Yes	We used Pytorch [Paszke et al., 2019] to conduct our experiments. We obtain the checkpoints of pretrained models Res Net18 with supervised contrastive loss and Res Net50 with supervised contrastive loss from Liu and Qin [2024] s work for a fair comparison. In the experiment where we aggregate different models confidences, Vi T-B/16 [Dosovitskiy et al., 2020] checkpoint is retrieved from the publicly available repository https://github.com/lukemelas/Py Torch-Pretrained Vi T/tree/master. In the experiment where we merge ORA with the activation shaping algorithms ASH [Djurisic et al., 2023], Scale [Xu et al., 2024b] and Re Act [Sun et al., 2021], we used the percentiles to set the thresholds 35, 90 and 80 respectively. For the extended results on Table 1, we used the timm [Wightman, 2020] checkpoints for the models Conv Ne Xt [Liu et al., 2022], Swin [Liu et al., 2021], Dei T [Touvron et al., 2021] and EVA [Fang et al., 2023]. Similarly, for the CLIP experiments on Table 2, we used the huggingface checkpoint of CLIP Vi T-H/14 [LAION].
Experiment Setup	Yes	In the experiment where we merge ORA with the activation shaping algorithms ASH [Djurisic et al., 2023], Scale [Xu et al., 2024b] and Re Act [Sun et al., 2021], we used the percentiles to set the thresholds 35, 90 and 80 respectively. For the extended results on Table 1, we used the timm [Wightman, 2020] checkpoints for the models Conv Ne Xt [Liu et al., 2022], Swin [Liu et al., 2021], Dei T [Touvron et al., 2021] and EVA [Fang et al., 2023]. Similarly, for the CLIP experiments on Table 2, we used the huggingface checkpoint of CLIP Vi T-H/14 [LAION]. All experiments are evaluated on a single Nvidia H100 GPU. Note that, thanks to our hyperparameter-free post-hoc score function, all experiments are deterministic given the pretrained model.