Test-Time Adaptation Induces Stronger Accuracy and Agreement-on-the-Line

Authors: Eungyeup Kim, Mingjie Sun, Christina Baek, Aditi Raghunathan, J. Zico Kolter

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Recently, Miller et al. [32] and Baek et al. [3] empirically demonstrated strong linear correlations between in-distribution (ID) versus out-of-distribution (OOD) accuracy and agreement. These trends, coined accuracy-on-the-line (ACL) and agreement-on-the-line (AGL), enable OOD model selection and performance estimation without labeled data. However, these phenomena also break for certain shifts, such as CIFAR10-C Gaussian Noise, posing a critical bottleneck. In this paper, we make a key finding that recent test-time adaptation (TTA) methods not only improve OOD performance, but drastically strengthen the ACL and AGL trends in models, even in shifts where models showed very weak correlations before. To analyze this, we revisit the theoretical conditions from Miller et al. [32] that outline the types of distribution shifts needed for perfect ACL in linear models. Surprisingly, these conditions are satisfied after applying TTA to deep models in the penultimate feature embedding space. In particular, TTA causes the data distribution to collapse complex shifts into those can be expressed by a singular scaling variable in the feature space. Our results show that by combining TTA with AGL-based estimation methods, we can estimate the OOD performance of models with high precision for a broader set of distribution shifts. This lends us a simple system for selecting the best hyperparameters and adaptation strategy without any OOD labeled data. Code is available at https://github.com/Eungyeup Kim/TTALine.
Researcher Affiliation Collaboration Eungyeup Kim1 Mingjie Sun1 Christina Baek1 Aditi Raghunathan1 J. Zico Kolter1,2 1Carnegie Mellon University 2Bosch Center for AI {eungyeuk, mingjies, kbaek, raditi, zkolter}@cs.cmu.edu
Pseudocode Yes Algorithm 1 Online Test in ID and OOD during TTA
Open Source Code Yes Code is available at https://github.com/Eungyeup Kim/TTALine.
Open Datasets Yes Our testbed includes diverse shifts, including common corruptions (15 failure shifts in CIFAR10-C, CIFAR100-C, and Image Net-C [14]), dataset reproductions (CIFAR10.1 [38], Image Net V2 [39]), and real-world shifts (Image Net-R [16], Camelyon17-WILDS, i Wild CAMWILDS, FMo W-WILDS [41]).
Dataset Splits Yes Camelyon17-WILDS [41] contains the medical images of tissue collected from different hospitals, where distribution shifts are designed by different hospitals the model is trained and tested. The task is to predict whether it has tumor tissue or not (binary classification). We followed the same evaluation protocol of Miller et al. [32], collecting 30 whole-slide images (WSI) from 3 hospitals as ID (total 302,436 for train and 33,560 for test), while 10 WSI from a different hospital as OOD (total 85,054).
Hardware Specification Yes We trained and tested all models and datasets in NVIDIA RTX 6000.
Software Dependencies No The paper mentions using 'torchvision and timm package' for pretrained model weights and 'SGD optimizer' or 'SAM optimizer' for training. However, it does not provide specific version numbers for these software components or any other key libraries/dependencies required for replication.
Experiment Setup Yes Table 6: The hyperparameter pools utilized for observing AGL across hyperparameters in CIFAR10, CIFAR100, Image Net, Camelyon17-WILDS, and i Wild CAM-WILDS dataset.