Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Test-Time Adaptive Object Detection with Foundation Model

Authors: Yingjie Gao, Yanan Zhang, Zhi Cai, Di Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.
Researcher Affiliation	Academia	Yingjie Gao1,2 Yanan Zhang3 Zhi Cai1,2 Di Huang1,2 1State Key Laboratory of Complex and Critical Software Environment, Beihang University 2School of Computer Science and Engineering, Beihang University 3School of Computer Science and Information Engineering, Hefei University of Technology EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using textual descriptions and diagrams (e.g., Figure 2 and 3), but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/gaoyingjay/ttaod_foundation.
Open Datasets	Yes	We evaluate the effectiveness of our method across a variety of TTAOD scenarios, covering two benchmarks: the cross-corruption benchmark and the cross-dataset benchmark. The cross-corruption benchmark is widely adopted in previous TTAOD works[2, 31, 11] to assess model robustness, specifically including two datasets: Pascal-C and COCO-C. Pascal-C is constructed from the test set of Pascal VOC[5] by applying an image corruption package [20]... COCO-C is generated from COCO [16]... We adopt the ODin W-13 datasets as a novel cross-dataset benchmark to evaluate the detector s performance across 13 diverse object detection datasets, each representing a distinct domain with different categories.
Dataset Splits	Yes	Pascal-C is constructed from the test set of Pascal VOC[5] by applying an image corruption package [20], which consists of 15 types of corruptions. Each corrupted test set contains 4956 images spanning 20 classes. COCO-C is generated from COCO [16], which contains 80 object categories. Following the same procedure as Pascal-C, we construct COCO-C using the COCO val2017 set, which includes 5k images, to serve as the target domains. We perform test-time adaptation on the test sets of 13 sub-datasets, providing a comprehensive evaluation of the model s adaptability across varying class spaces.
Hardware Specification	Yes	All experiments are conducted on a single RTX 3090 GPU.
Software Dependencies	No	The paper mentions using "Grounding DINO with Swin-Tiny[18] as the visual backbone" and "DINOv2 with Vi T-L[4] as the feature extractor". While these are specific models/frameworks, the paper does not specify general software dependencies like Python, PyTorch, or CUDA versions with their specific version numbers, which are required for a reproducible description of ancillary software.
Experiment Setup	Yes	For the cross-corruption benchmark, we set the learning rate of the Adam W optimizer to 0.02 for text prompts and 0.2 for visual prompts, while freezing all other parameters pre-trained on large-scale data. The batch size is set to 4. For other hyperparameters, we set thpl to 0.3, thme to 0.3, and th Io U to 0.2. The momentum coefficient γ in Eq. 6 is set to 0.999. The number m of visual prompts is set to 10, and the maximum capacity \|Qc\|max of IDM is set to 20. We set α = 5.0 and β = 5.0 for Pascal-C, while α = 1.0 and β = 5.0 for COCO-C. ... For the cross-dataset benchmark, we set thpl and thme to 0.3. The number m of visual prompts is set to 5, and the maximum capacity \|Qc\|max of IDM is set to 3. We use α = 1.0 and β = 5.0 across all datasets.