Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Training-Free Test-Time Adaptation via Shape and Style Guidance for Vision-Language Models

Authors: Shenglong Zhou, Manjiang Yin, Leiyu Sun, Shicai Yang, Di Xie, Jiang Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results on out-of-distribution and cross-domain benchmark datasets demonstrate that our proposed SSG consistently outperforms previous state-of-the-art methods while also exhibiting promising computational efficiency.
Researcher Affiliation	Collaboration	1Hikvision Research Institute 2University of Science and Technology of China
Pseudocode	No	The paper describes the method and its components (PPD, PPDsh, PPDst) in text and uses figures to illustrate concepts, but does not include a formal pseudocode or algorithm block.
Open Source Code	No	We use the public dataset as mentioned in Section 4, and we will release the codes after the final decision.
Open Datasets	Yes	Out-of-distribution benchmark aims to evaluate the model s robustness to natural distribution shifts on 4 Image Net [33] variants, including Image Net-A [34], Image Net-V2 [35], Image Net-R [36], and Image Net-Sketch [37]. The cross-domain benchmark aims to evaluate the transferring performance on 10 diverse recognition datasets, including FGVCAircraft [38], Caltech101 [39], Standford Cars [40], DTD [41], Euro SAT [42], Flowers102 [43], Food101 [44], Oxford Pets [45], SUN397 [46], and UCF101 [47].
Dataset Splits	Yes	We follow the split in Co Op [16], and more details are shown in Appendix.
Hardware Specification	Yes	All experiments are conducted on a single 24GB NVIDIA RTX 4090 GPU.
Software Dependencies	No	The paper does not provide specific software version numbers. It mentions the general use of CLIP backbones, but no explicit versions for libraries like PyTorch, TensorFlow, or Python.
Experiment Setup	Yes	Following TPT [7] and TDA [9], we set the batch size as 1 and generate 63 augmented views for each test image, while setting the k as the top-10%. All experiments are conducted on a single 24GB NVIDIA RTX 4090 GPU. ...the utilised cache in SSG is a dynamic key-value cache, whose memory size is 3 for all datasets.