Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ReservoirTTA: Prolonged Test-time Adaptation for Evolving and Recurring Domains

Authors: Guillaume Vray, Devavrat Tomar, Xufeng Gao, Jean-Philippe Thiran, Evan Shelhamer, Behzad Bozorgtabar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on scenelevel corruption benchmarks (Image Net-C, CIFAR-10/100-C), object-level style shifts (Domain Net-126, PACS), and semantic segmentation (Cityscapes ACDC) covering recurring and continuously evolving domain shifts show that Reservoir TTA substantially improves adaptation accuracy and maintains stable performance across prolonged, recurring shifts, outperforming state-of-the-art methods.
Researcher Affiliation	Academia	1EPFL 2CHUV 3UBC 4Vector Institute 5Aarhus University 1,2{firstname.lastname}@epfl.ch 3,EMAIL
Pseudocode	Yes	Model Prediction Predictions are then obtained via the ensemble s parameters from all domain-specific models (see pseudocode in Appendix C).
Open Source Code	Yes	Our code is publicly available at https://github.com/LTS5/Reservoir TTA.
Open Datasets	Yes	Extensive experiments on scenelevel corruption benchmarks (Image Net-C, CIFAR-10/100-C), object-level style shifts (Domain Net-126, PACS), and semantic segmentation (Cityscapes ACDC)... Note CIFAR10-C, CIFAR100-C, and Image Net-C are publicly available online5 (Apache-2.0 license). CCC is also provided by Rdumb paper6 [31] (MIT license). Both Domain Net-1267 and PACS8 are publicly available.
Dataset Splits	Yes	Classification is tested under CCC [31], CSC, and CDC settings over 20 rounds (averaging error rates, %; a subset is shown for clarity). For segmentation, we follow the Cityscapes ACDC protocol [42], where ACDC presents four weather conditions (Fog, Night, Rain, Snow) sequentially. We report the mean Io U (%) averaged over 10 repetitions.
Hardware Specification	Yes	All experiments were run on a single NVIDIA A100 Tensor Core GPU (80 GB VRAM) on our internal cluster.
Software Dependencies	No	All methods are re-implemented in Py Torch [29] within a unified TTA repository [24] for fair comparison, using pre-trained source models from Robust Bench [11].
Experiment Setup	Yes	For CIFAR-10-C and CIFAR-100-C, TTA baselines (except SAR [28]) are optimized with the Adam optimizer [17] using a learning rate of 1 10 3, a universal ̒ = (0.9, 0.999), and a batch size of 200, whereas SAR employs SGD [32]. For Image Net-C, models are adapted with SGD at a batch size of 64 and a learning rate of 2.5 10 4 (adjusted to 1 10 4 for Vi T-B-16 in the CCC setting). For Reservoir TTA, we configure the system with a maximum of Kmax = 16 reservoirs, determine the threshold ̘ using 2000 source examples, and update centroids with Adam W [23] at a learning rate of 1 10 4. Table 10 summarizes these settings.