Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Test-Time Adaptation with Binary Feedback
Authors: Taeckyung Lee, Sorn Chottananurak, Junsu Kim, Jinwoo Shin, Taesik Gong, Sung-Ju Lee
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show Bi TTA achieves 13.3%p accuracy improvements over state-of-the-art baselines, demonstrating its effectiveness in handling severe distribution shifts with minimal labeling effort. The source code is available at https: //github.com/taeckyung/Bi TTA. |
| Researcher Affiliation | Academia | Corresponding authors. 1KAIST 2UNIST. Correspondence to: Taesik Gong <EMAIL>, Sung-Ju Lee <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Bi TTA Algorithm |
| Open Source Code | Yes | The source code is available at https: //github.com/taeckyung/Bi TTA. |
| Open Datasets | Yes | To evaluate the robustness of Bi TTA across various domain shifts, we used standard image corruption datasets CIFAR10-C, CIFAR100-C, and Tiny-Image Net C (Hendrycks & Dietterich, 2019). Additionally, we conducted experiments on the PACS dataset (Li et al., 2017), which is commonly used for domain adaptation tasks. |
| Dataset Splits | No | The paper does not explicitly provide specific percentages, sample counts, or citations for predefined training/validation/test splits within the main text. It mentions using 'standard image corruption datasets' and details about training the source model, but without explicit split specifications for reproducibility. |
| Hardware Specification | Yes | The experiments were mainly conducted on NVIDIA RTX 3090 and TITAN GPUs. |
| Software Dependencies | No | The paper mentions 'Torch Vision' and 'PyTorch' implicitly through citations but does not provide specific version numbers for these or other software libraries like Python or CUDA, which are required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | We configured Bi TTA to operate with minimal labeling effort, using only three binary feedback samples within each 64-sample test batch, accounting for less than 5%. We utilize a single value of balancing hyperparameters α = 2 and β = 1 for Bi TTA in all experiments. For CIFAR10-C/CIFAR100-C/Tiny-Image Net-C, we trained the model with the source data with a learning rate of 0.1/0.1/0.001 and a momentum of 0.9, with cosine annealing learning rate scheduling for 200 epochs. For PACS, we fine-tuned the pre-trained weights from Image Net on the selected source domains for 3,000 iterations using the Adam optimizer with a learning rate of 0.0001. During adaptation, we update all parameters, including BN stats, with an SGD optimizer with a learning rate/epoch of 0.001/3 (PACS), 0.0001/3 (CIFAR10-C, CIFAR100-C), and 0.00005/5 (Tiny-Image Net-C) on the entire model. We applied weight decay of 0.05 to PACS and 0.0 otherwise. ... With 4 dropout instances, we apply a dropout rate of 0.3 for small-scale datasets (e.g., CIFAR10-C, CIFAR100-C, PACS) and 0.1 for large-scale datasets (e.g., Tiny-Image Net-C, Image Net-C). |