On Pitfalls of Test-Time Adaptation

Authors: Hao Zhao, Yuejiang Liu, Alexandre Alahi, Tao Lin

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, our benchmark reveals three common pitfalls in prior efforts.
Researcher Affiliation Academia Hao Zhao 1 * Yuejiang Liu 1 * Alexandre Alahi 1 Tao Lin 2 3 *Equal contribution 1École Polytechnique Fédérale de Lausanne (EPFL) 2Research Center for Industries of the Future, Westlake University 3School of Engineering, Westlake University.
Pseudocode Yes Algorithm 1 Oracle model selection for online TTA
Open Source Code Yes Our code is available at https: //github.com/lins-lab/ttab.
Open Datasets Yes To streamline standardized evaluations of TTA methods, we first equip the benchmark library with shared data loaders for a set of common datasets, including CIFAR10-C (Hendrycks & Dietterich, 2019), CIFAR10.1 (Recht et al., 2018), Image Net-C (Hendrycks & Dietterich, 2019), Office Home (Venkateswara et al., 2017), PACS (Li et al., 2017), Colored MNIST (Arjovsky et al., 2019), and Waterbirds (Sagawa et al., 2019).
Dataset Splits Yes Training-domain validation data is used to determine the number of supports to store in T3A following Iwasawa & Matsuo (2021).
Hardware Specification No The paper does not provide specific details on the hardware used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup Yes We use Res Net-18/Res Net-26/Res Net-50 as the base model on Colored MNIST/CIFAR10-C/large-scale image datasets and always choose SGDm as the optimizer. We choose method-specific hyperparameters following prior work. Following Iwasawa & Matsuo (2021), we assign the pseudo label in SHOT if the predictions are over a threshold which is 0.9 in our experiment and utilize β = 0.3 for all experiments except β = 0.1 for Colored MNIST just as Liang et al. (2020). We set the number of augmentations B = 32 for small-scale images (e.g. CIFAR10-C, CIFAR100-C) and B = 64 for large-scale image sets like Image Net-C, becasue this is the default option in Sun et al. (2020) and Zhang et al.. We simply set N = 0 that controls the trade-off between source and estimated target statistics because it achieves performance comparable to the best performance when using a batch size of 64 according to Schneider et al. (2020).