Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
RDumb: A simple approach that questions our progress in continual test-time adaptation
Authors: Ori Press, Steffen Schneider, Matthias Kรผmmerer, Matthias Bethge
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To examine the reported progress in the field, we propose the Continually Changing Corruptions (CCC) benchmark to measure asymptotic performance of TTA techniques. We find that eventually all but one state-of-the-art methods collapse and perform worse than a non-adapting model, including models specifically proposed to be robust to performance collapse. In addition, we introduce a simple baseline, RDumb , that periodically resets the model to its pretrained state. RDumb performs better or on par with the previously proposed state-of-the-art in all considered benchmarks. |
| Researcher Affiliation | Academia | Ori Press1 Steffen Schneider1,2 Matthias K ummerer1 Matthias Bethge1 1University of T ubingen, T ubingen AI Center, Germany 2EPFL, Geneva, Switzerland |
| Pseudocode | Yes | Algorithm 1 describes the pseudo code of the algorithm used to generate CCC. |
| Open Source Code | Yes | Code: https://github.com/oripress/CCC. |
| Open Datasets | Yes | Image Net-C [12]: Creative Commons Attribution 4.0 International, https://zenodo.org/record/2235448 Image Net-C [12], code for generating corruptions: Apache License 2.0 https://github.com/hendrycks/robustness Image Net-3D-CC [16]: CC-BY-NC 4.0 License https://github.com/EPFL-VILAB/3DCommon Corruptions |
| Dataset Splits | Yes | We select a subset of 5,000 images from the Image Net validation set. For each corruption (c1, s1, c2, s2), we corrupt all 5,000 images accordingly and evaluate the resulting images with a pre-trained Res Net-50 [10]. The resulting accuracy is what we refer to as baseline accuracy and what we use for controlling difficulty. |
| Hardware Specification | Yes | We conduct all experiments on Nvidia RTX 2080 TI GPUs with 12GB memory per device. |
| Software Dependencies | No | Py Torch s [31] Backbones https://pytorch.org/vision/stable/models.html - This reference to PyTorch backbones does not specify the version of PyTorch or any other relevant software libraries used for implementation, which is necessary for reproducibility. |
| Experiment Setup | Yes | For all models, we use a batch size of 64. Following the original implementations, Tent, ETA, EATA, and RDumb use SGD with a learning rate of 2.5 10 4. RPL uses SGD with a learning rate of 5 10 4. SLR uses the Adam optimizer with a learning rate of 6 10 4. Co TTA uses SGD with a learning rate of 0.01, and CPL uses SGD with a learning rate of 0.001. We reset every T = 1, 000 steps, as determined by a hyperparameter search on the holdout set (Section 6). |