Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models

Authors: Konstantinos Dafnis, Dimitris Metaxas

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive experiments demonstrate that STS largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8 faster with a 12 smaller memory footprint than conventional test-time prompt tuning. The code is available at https://github.com/kdafnis/STS.
Researcher Affiliation	Academia	Konstantinos M. Dafnis Department of Computer Science Rutgers University Dimitris N. Metaxas Department of Computer Science Rutgers University
Pseudocode	No	The paper describes the methodology in prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/kdafnis/STS.
Open Datasets	Yes	We conduct a comprehensive evaluation of our method across a diverse set of benchmark datasets, with a particular focus on out-of-domain generalization. To assess the model s ability to handle distribution shifts, we utilize several Image Net variants, including Image Net-A [16], Image Net-V2 [31], Image Net-R [14], and Image Net-Sketch (also referred to as Image Net-K) [37]. For Fine-grained Classification (also referred to as "Cross-Datasets Generalization in previous works), in line with [32], we include Flowers102 [27], DTD [5], Pets [29], UCF [33], and Caltech101 [8]. Furthermore, to assess the model s adaptability across diverse domains, we incorporate Aircraft [25], Euro SAT [13], Cars [22], Food [3], and SUN397 [38].
Dataset Splits	Yes	For all datasets, we utilize the test splits defined by Zhou et al. [43], adhering to the common evaluation protocol. ... In Table D1, we present the detailed statistics of each dataset we used in our experiments, including the number of classes, the sizes of training, validation and testing sets, and their original tasks.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA RTX8000 GPU with 45GB of memory.
Software Dependencies	No	The paper mentions the AdamW [24] optimizer and CLIP [30] as models/frameworks but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	The learnable vector is initialized to zero and optimized for a single step using the Adam W [24] optimizer with a learning rate of 5e-3 across all datasets. In our method, each class prototype is initialized using the hand-crafted prompt, a photo of a {CLASS}. ... To identify high-confidence samples, we select the 10% of batch samples with the lowest entropy and compute the marginal entropy based on their predicted probability distributions.