Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Test Time Adaptation via Conjugate Pseudo-labels

Authors: Sachin Goyal, Mingjie Sun, Aditi Raghunathan, J. Zico Kolter

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our approach consistently dominates other TTA alternatives over a wide range of domain adaptation benchmarks. Our approach is particularly of interest when applied to classifiers trained with novel loss functions, e.g., the recently-proposed Poly Loss [25] function, where it differs substantially from (and outperforms) an entropy-based loss. Further, we show that our conjugate based approach can also be interpreted as a kind of self-training using a very specific soft label, which we refer to as the conjugate pseudo-label. Overall, our method provides a broad framework for better understanding and improving test-time adaptation. Code is available at https://github.com/locuslab/ tta_conjugate.
Researcher Affiliation Collaboration Sachin Goyal 1 Mingjie Sun 1 Aditi Raghunathan1 Zico Kolter1,2 1Carnegie Mellon University, 2Bosch Center for AI EMAIL
Pseudocode Yes The full procedure for test time adaptation via conjugate pseudo-labels is shown in Algorithm 1. (Algorithm 1 is presented on page 6).
Open Source Code Yes Code is available at https://github.com/locuslab/ tta_conjugate.
Open Datasets Yes We evaluate on the three common corruption benchmarks: adapting a classifier trained on CIFAR-10 to CIFAR-10-C, CIFAR-100 to CIFAR-100-C and Image Net to Image Net-C [15]. ... We also evaluate on three domain adaptation datasets: adapting a classifier trained on SVHN to MNIST, an Image Net classifier to Image Net-R [16] and adapting from synthetic to real data in VISDA-C [38].
Dataset Splits Yes We tune the learning rate (LR) and temperature (T) on the validation noises in the corruption benchmark by grid-search. LR is selected from {1e 1, 1e 2, . . . 1e 4} and T from {1, 2 . . . 5}. All the experiments have been performed on A6000 GPU s.
Hardware Specification Yes All the experiments have been performed on A6000 GPU s.
Software Dependencies No The paper does not provide specific version numbers for software dependencies (e.g., libraries like PyTorch, TensorFlow, or specific Python versions).
Experiment Setup Yes We tune the learning rate (LR) and temperature (T) on the validation noises in the corruption benchmark by grid-search. LR is selected from {1e 1, 1e 2, . . . 1e 4} and T from {1, 2 . . . 5}. ... Following [50] and [40], we fine-tune by updating the learnable scale and shift parameters of the batch normalization layers across all adaptation losses. For each batch, batch normalization statistics is also updated, as suggested in [41].