RDumb: A simple approach that questions our progress in continual test-time adaptation

Authors: Ori Press, Steffen Schneider, Matthias Kümmerer, Matthias Bethge

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To examine the reported progress in the field, we propose the Continually Changing Corruptions (CCC) benchmark to measure asymptotic performance of TTA techniques. We find that eventually all but one state-of-the-art methods collapse and perform worse than a non-adapting model, including models specifically proposed to be robust to performance collapse. In addition, we introduce a simple baseline, RDumb , that periodically resets the model to its pretrained state. RDumb performs better or on par with the previously proposed state-of-the-art in all considered benchmarks.
Researcher Affiliation Academia Ori Press1 Steffen Schneider1,2 Matthias K ummerer1 Matthias Bethge1 1University of T ubingen, T ubingen AI Center, Germany 2EPFL, Geneva, Switzerland
Pseudocode Yes Algorithm 1 describes the pseudo code of the algorithm used to generate CCC.
Open Source Code Yes Code: https://github.com/oripress/CCC.
Open Datasets Yes Image Net-C [12]: Creative Commons Attribution 4.0 International, https://zenodo.org/record/2235448 Image Net-C [12], code for generating corruptions: Apache License 2.0 https://github.com/hendrycks/robustness Image Net-3D-CC [16]: CC-BY-NC 4.0 License https://github.com/EPFL-VILAB/3DCommon Corruptions
Dataset Splits Yes We select a subset of 5,000 images from the Image Net validation set. For each corruption (c1, s1, c2, s2), we corrupt all 5,000 images accordingly and evaluate the resulting images with a pre-trained Res Net-50 [10]. The resulting accuracy is what we refer to as baseline accuracy and what we use for controlling difficulty.
Hardware Specification Yes We conduct all experiments on Nvidia RTX 2080 TI GPUs with 12GB memory per device.
Software Dependencies No Py Torch s [31] Backbones https://pytorch.org/vision/stable/models.html - This reference to PyTorch backbones does not specify the version of PyTorch or any other relevant software libraries used for implementation, which is necessary for reproducibility.
Experiment Setup Yes For all models, we use a batch size of 64. Following the original implementations, Tent, ETA, EATA, and RDumb use SGD with a learning rate of 2.5 10 4. RPL uses SGD with a learning rate of 5 10 4. SLR uses the Adam optimizer with a learning rate of 6 10 4. Co TTA uses SGD with a learning rate of 0.01, and CPL uses SGD with a learning rate of 0.001. We reset every T = 1, 000 steps, as determined by a hyperparameter search on the holdout set (Section 6).