Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Statistics Caching Test-Time Adaptation for Vision-Language Models

Authors: Zenghao Guan, Yucan Zhou, Wu Liu, Xiaoyan Gu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SCA on 15 diverse datasets across out-of-domain and cross-domain benchmarks, where it achieves notable performance gains over state-of-the-art prompt tuning and cache-based methods. Extensive experiments demonstrate that SCA achieves compelling performance while maintaining competitive computational efficiency. The code is available at this link. Section 4: Experiments. Table 1: Experimental results on the cross-domain benchmark with two backbones of CLIP. Table 2: Experimental results on the OOD benchmark with two backbones of CLIP. Section 4.3: Ablation Studies.
Researcher Affiliation	Academia	Zenghao Guan1,2,3, Yucan Zhou4, Wu Liu5, Xiaoyan Gu1,2,3 1Institute of Information Engineering, Chinese Academy of Sciences, 2School of Cyber Security, University of Chinese Academy of Sciences, 3State Key Laboratory of Cyberspace Security Defense, 4Tianjin University, 5University of Science and Technology of China EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using mathematical equations and textual explanations, but it does not contain a clearly labeled pseudocode block or algorithm steps formatted like code.
Open Source Code	No	The code is available at this link. (In the NeurIPS Paper Checklist, Question 5: Answer: [No] Justification: Code will be released after being accepted.)
Open Datasets	Yes	Datasets. Following prior work [12, 13, 15, 27], we evaluate our method using two established benchmarks: the cross-domain benchmark and the out-of-distribution (OOD) benchmark. (1) The cross-domain benchmark includes 10 diverse image classification datasets from distinct domains: FGVCAircraft [28], Caltech101 [29], Stanford Cars [30], DTD [31], Euro SAT [32], Flowers102 [22], Food101 [33], Oxford Pets [34], SUN397 [35], and UCF101 [36]. (2) The OOD benchmark evaluates performance on Image Net [37] and four challenging variants: Image Net-A [9], Image Net-V2 [10], Image Net-R [38], and Image Net-Sketch [39].
Dataset Splits	Yes	Table B1: Summary of the 15 image classification datasets used in experiments. The last four Image Net variant datasets are designed for evaluation only and contain no training or validation splits. Benchmark Dataset Classes Splits Task train val test Cross-Domain Caltech101 [29] 100 4,128 1,649 2,465 Object recognition DTD [31] 47 2,820 1,128 1,692 Texture recognition Euro SAT [32] 10 13,500 5,400 8,100 Satellite imagery FGVCAircraft [28] 100 3,334 3,333 3,333 Fine-grained aircraft recognition Flowers102 [22] 102 4,093 1,633 2,463 Fine-grained flower recognition Food101 [33] 101 50,500 20,200 30,300 Fine-grained food recognition Oxford Pets [34] 37 2,944 736 3,669 Fine-grained pet recognition Stanford Cars [30] 196 6,509 1,635 8,041 Fine-grained car recognition SUN397 [35] 397 15,880 3,970 19,850 Scene recognition UCF101 [36] 101 7,639 1,898 3,783 Action recognition Out-of-Distribution Image Net [37] 1,000 1.28M 50,000 Object recognition Image Net-V2 [10] 1,000 10,000 Robustness (collocation shift) Image Net-Sketch [39] 1,000 50,889 Robustness (sketch domain) Image Net-A [9] 200 7,500 Robustness (adversarial attack) Image Net-R [38] 200 30,000 Robustness (multi-domain)
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA A100 RTX GPU, using top-1 accuracy to measure classification performance.
Software Dependencies	No	The paper mentions using CLIP with ResNet-50 and ViT-B/16 as visual encoders, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Implementation Details. Consistent with prior work [15, 25], our experiments adopt ResNet-50 [40] and ViT-B/16 [41] as the visual encoders for CLIP, with a batch size of 1 to satisfy online processing requirements. For textual prompts, we employ the prompt ensembling strategy as used in previous works [15, 42, 20, 25]. For data augmentation, we adopt the approach from DPE [12], generating 63 randomly resized crops per test image for the OOD benchmark. No data augmentation is applied in the cross-domain benchmark. By default, we set the ridge coefficient γ to 1e4, threshold τ to 0.1, sharpness coefficient β to 0.5. All experiments are conducted on a single NVIDIA A100 RTX GPU, using top-1 accuracy to measure classification performance.