Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DOTA: Distributional Test-time Adaptation of Vision-Language Models
Authors: Zongbo Han, Jialong Yang, Guangyu Wang, Junfan Li, Qianli Xu, Mike Zheng Shou, Changqing Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate that Dota significantly mitigates forgetting and achieves state-of-the-art performance compared to existing methods. Extensive experiments on diverse datasets validate the effectiveness of the proposed method, demonstrating a significant improvement. |
| Researcher Affiliation | Academia | School of Information and Communication Engineering, Beijing University of Posts and Telecommunications1 State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications2 College of Intelligence and Computing, Tianjin University3 School of Computer Science and Technology, Harbin Institute of Technology Shenzhen4 Institute for Infocomm Research (I2R), A*STAR Research Entities (ARES)5 Show Lab, National University of Singapore6 |
| Pseudocode | Yes | Algorithm 1: The pseudocode of Dota. |
| Open Source Code | Yes | Code is available at https://github.com/skylineeeeen/DOTA. |
| Open Datasets | Yes | Aircraft [33], Caltech101 [12], Cars [29], DTD [7], Euro SAT [20], Flower102 [36], Food101 [4], Pets [37], SUN397 [49], and UCF101 [43]. ... Image Net [10], Image Net-A [22], Image Net-R [21], Image Net-S [46] and Image Net-V2 [40]... Kather Colon dataset [26], Pan Nuke dataset [15], and the WSSS4LUAD dataset [18] |
| Dataset Splits | Yes | Table 15: Datasets details. Dataset ... Validation Size Test Size ... Image Net N/A 50,000 ... Caltech101 1,649 2,465 ... |
| Hardware Specification | Yes | All experiments are conducted using a single NVIDIA RTX 4090 GPU and a 12-core Intel Xeon Platinum 8352V CPU. |
| Software Dependencies | No | The paper mentions building upon the pre-trained CLIP model but does not specify versions for any ancillary software libraries like PyTorch or Python. |
| Experiment Setup | Yes | Test-time adaptation is set for single-image scenarios, using a batch size of 1. ... We adjust σ2 within [0.001, 0.002, 0.004], then search for the best η across [0.2, 0.3, 0.4, 0.5] and ρ across [0.005, 0.01, 0.02, 0.03], with the shrinkage parameter ϵ set to 0.0001. ... Another hyperparameter ω we consistently set to 0.001. |