Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Test-Time Adaptation with Binary Feedback

Authors: Taeckyung Lee, Sorn Chottananurak, Junsu Kim, Jinwoo Shin, Taesik Gong, Sung-Ju Lee

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show Bi TTA achieves 13.3%p accuracy improvements over state-of-the-art baselines, demonstrating its effectiveness in handling severe distribution shifts with minimal labeling effort. The source code is available at https: //github.com/taeckyung/Bi TTA.
Researcher Affiliation Academia Corresponding authors. 1KAIST 2UNIST. Correspondence to: Taesik Gong <EMAIL>, Sung-Ju Lee <EMAIL>.
Pseudocode Yes Algorithm 1 Bi TTA Algorithm
Open Source Code Yes The source code is available at https: //github.com/taeckyung/Bi TTA.
Open Datasets Yes To evaluate the robustness of Bi TTA across various domain shifts, we used standard image corruption datasets CIFAR10-C, CIFAR100-C, and Tiny-Image Net C (Hendrycks & Dietterich, 2019). Additionally, we conducted experiments on the PACS dataset (Li et al., 2017), which is commonly used for domain adaptation tasks.
Dataset Splits No The paper does not explicitly provide specific percentages, sample counts, or citations for predefined training/validation/test splits within the main text. It mentions using 'standard image corruption datasets' and details about training the source model, but without explicit split specifications for reproducibility.
Hardware Specification Yes The experiments were mainly conducted on NVIDIA RTX 3090 and TITAN GPUs.
Software Dependencies No The paper mentions 'Torch Vision' and 'PyTorch' implicitly through citations but does not provide specific version numbers for these or other software libraries like Python or CUDA, which are required for a reproducible description of ancillary software.
Experiment Setup Yes We configured Bi TTA to operate with minimal labeling effort, using only three binary feedback samples within each 64-sample test batch, accounting for less than 5%. We utilize a single value of balancing hyperparameters α = 2 and β = 1 for Bi TTA in all experiments. For CIFAR10-C/CIFAR100-C/Tiny-Image Net-C, we trained the model with the source data with a learning rate of 0.1/0.1/0.001 and a momentum of 0.9, with cosine annealing learning rate scheduling for 200 epochs. For PACS, we fine-tuned the pre-trained weights from Image Net on the selected source domains for 3,000 iterations using the Adam optimizer with a learning rate of 0.0001. During adaptation, we update all parameters, including BN stats, with an SGD optimizer with a learning rate/epoch of 0.001/3 (PACS), 0.0001/3 (CIFAR10-C, CIFAR100-C), and 0.00005/5 (Tiny-Image Net-C) on the entire model. We applied weight decay of 0.05 to PACS and 0.0 otherwise. ... With 4 dropout instances, we apply a dropout rate of 0.3 for small-scale datasets (e.g., CIFAR10-C, CIFAR100-C, PACS) and 0.1 for large-scale datasets (e.g., Tiny-Image Net-C, Image Net-C).