Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

X-Mahalanobis: Transformer Feature Mixing for Reliable OOD Detection

Authors: Tong Wei, Bolin Wang, Jiang-Xin Shi, Yu-Feng Li, Min-Ling Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive empirical analyses to validate the superiority of our proposed method under zero-shot, and fine-tuning settings using both class-balanced and long-tailed datasets. 4 Experiments We extensively evaluate X-Maha across different datasets and pre-trained models.
Researcher Affiliation	Academia	1School of Computer Science and Engineering, Southeast University, Nanjing, China 2Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China 3National Key Laboratory for Novel Software Technology, Nanjing University, China 4School of Artificial Intelligence, Nanjing University, China EMAIL
Pseudocode	No	The paper describes the method and calculations in text and mathematical formulas such as Eq. (1) and (5), but does not contain an explicitly labeled pseudocode or algorithm block.
Open Source Code	Yes	The source code is available at https://github.com/SEUML/X-Maha.
Open Datasets	Yes	In this section, we compare our approach with the latest algorithms across both smalland large-scale OOD detection benchmarks. In line with prior research, we utilize CIFAR-100 and Image Net as the in-distribution (ID) datasets. Additionally, we incorporate the more challenging long-tailed variants, CIFAR-100-LT and Image Net-LT, as ID training sets to further demonstrate the effectiveness of our proposed method in OOD detection scenarios in the appendix. The imbalance ratio for CIFAR-100-LT is set to 100, reflecting a highly imbalanced class distribution. OOD datasets. When CIFAR-100 or CIFAR-100-LT is used as the ID dataset, we evaluate OOD detection performance on a range of diverse datasets, including Textures [6], SVHN [57], CIFAR-10, Tiny Image Net [27], LSUN [56], and Places365 [60]. For experiments with Image Net and Image Net LT as the ID datasets, our primary evaluation employs five established OOD datasets: Textures [6], Places365 [60], i Naturalist [47], Image Net-O [15], and SUN [55]. Extended analysis using Open OODv1.5 [59] is presented in the Appendix.
Dataset Splits	No	In line with prior research, we utilize CIFAR-100 and Image Net as the in-distribution (ID) datasets. [...] We fine-tune the pre-trained models using in-distribution data for downstream tasks. While standard splits are implicitly used for these common datasets, the paper does not explicitly state the split percentages, sample counts, or specific predefined split files, making it hard to reproduce the exact data partitioning without external knowledge.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA RTX 3090 GPU.
Software Dependencies	No	We implement our approach and all competing methods in the same framework on top of the Image Net-21k pre-trained Vision Transformer (Vi T) [8] and the official pretrained CLIP model. The paper mentions models and frameworks but does not provide specific version numbers for software dependencies like PyTorch, TensorFlow, CUDA, etc.
Experiment Setup	Yes	We employ a batch size of 64 for all experiments. For CIFAR-100 and CIFAR-100-LT, we set the initial learning rate to 0.01 with a cosine annealing scheduler and fine-tune for 10 epochs. For Image Net and Image Net-LT, the initial learning rate is set to 0.1, with a cosine annealing scheduler, and the models are fine-tuned for 5 and 20 epochs, respectively. We set λ = 1 on Image Net and λ = 0.1 on CIFAR-100 for the CLIP model to calculate the scoring function. For the Adaptformer module, we set the dimension to C 2L, where C is the number of classes, and L is the number of blocks in the Vi T model. Other hyperparameters include a momentum of 0.9, and a weight decay of 5 10 4, following LIFT [44].