RDumb: A simple approach that questions our progress in continual test-time adaptation
Authors: Ori Press, Steffen Schneider, Matthias Kümmerer, Matthias Bethge
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To examine the reported progress in the field, we propose the Continually Changing Corruptions (CCC) benchmark to measure asymptotic performance of TTA techniques. We find that eventually all but one state-of-the-art methods collapse and perform worse than a non-adapting model, including models specifically proposed to be robust to performance collapse. In addition, we introduce a simple baseline, RDumb , that periodically resets the model to its pretrained state. RDumb performs better or on par with the previously proposed state-of-the-art in all considered benchmarks. |
| Researcher Affiliation | Academia | Ori Press1 Steffen Schneider1,2 Matthias K ummerer1 Matthias Bethge1 1University of T ubingen, T ubingen AI Center, Germany 2EPFL, Geneva, Switzerland |
| Pseudocode | Yes | Algorithm 1 describes the pseudo code of the algorithm used to generate CCC. |
| Open Source Code | Yes | Code: https://github.com/oripress/CCC. |
| Open Datasets | Yes | Image Net-C [12]: Creative Commons Attribution 4.0 International, https://zenodo.org/record/2235448 Image Net-C [12], code for generating corruptions: Apache License 2.0 https://github.com/hendrycks/robustness Image Net-3D-CC [16]: CC-BY-NC 4.0 License https://github.com/EPFL-VILAB/3DCommon Corruptions |
| Dataset Splits | Yes | We select a subset of 5,000 images from the Image Net validation set. For each corruption (c1, s1, c2, s2), we corrupt all 5,000 images accordingly and evaluate the resulting images with a pre-trained Res Net-50 [10]. The resulting accuracy is what we refer to as baseline accuracy and what we use for controlling difficulty. |
| Hardware Specification | Yes | We conduct all experiments on Nvidia RTX 2080 TI GPUs with 12GB memory per device. |
| Software Dependencies | No | Py Torch s [31] Backbones https://pytorch.org/vision/stable/models.html - This reference to PyTorch backbones does not specify the version of PyTorch or any other relevant software libraries used for implementation, which is necessary for reproducibility. |
| Experiment Setup | Yes | For all models, we use a batch size of 64. Following the original implementations, Tent, ETA, EATA, and RDumb use SGD with a learning rate of 2.5 10 4. RPL uses SGD with a learning rate of 5 10 4. SLR uses the Adam optimizer with a learning rate of 6 10 4. Co TTA uses SGD with a learning rate of 0.01, and CPL uses SGD with a learning rate of 0.001. We reset every T = 1, 000 steps, as determined by a hyperparameter search on the holdout set (Section 6). |