Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Test-Time Adaptation with Binary Feedback

Authors: Taeckyung Lee, Sorn Chottananurak, Junsu Kim, Jinwoo Shin, Taesik Gong, Sung-Ju Lee

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show Bi TTA achieves 13.3%p accuracy improvements over state-of-the-art baselines, demonstrating its effectiveness in handling severe distribution shifts with minimal labeling effort. The source code is available at https: //github.com/taeckyung/Bi TTA.
Researcher Affiliation	Academia	Corresponding authors. 1KAIST 2UNIST. Correspondence to: Taesik Gong <EMAIL>, Sung-Ju Lee <EMAIL>.
Pseudocode	Yes	Algorithm 1 Bi TTA Algorithm
Open Source Code	Yes	The source code is available at https: //github.com/taeckyung/Bi TTA.
Open Datasets	Yes	To evaluate the robustness of Bi TTA across various domain shifts, we used standard image corruption datasets CIFAR10-C, CIFAR100-C, and Tiny-Image Net C (Hendrycks & Dietterich, 2019). Additionally, we conducted experiments on the PACS dataset (Li et al., 2017), which is commonly used for domain adaptation tasks.
Dataset Splits	No	The paper does not explicitly provide specific percentages, sample counts, or citations for predefined training/validation/test splits within the main text. It mentions using 'standard image corruption datasets' and details about training the source model, but without explicit split specifications for reproducibility.
Hardware Specification	Yes	The experiments were mainly conducted on NVIDIA RTX 3090 and TITAN GPUs.
Software Dependencies	No	The paper mentions 'Torch Vision' and 'PyTorch' implicitly through citations but does not provide specific version numbers for these or other software libraries like Python or CUDA, which are required for a reproducible description of ancillary software.
Experiment Setup	Yes	We configured Bi TTA to operate with minimal labeling effort, using only three binary feedback samples within each 64-sample test batch, accounting for less than 5%. We utilize a single value of balancing hyperparameters α = 2 and β = 1 for Bi TTA in all experiments. For CIFAR10-C/CIFAR100-C/Tiny-Image Net-C, we trained the model with the source data with a learning rate of 0.1/0.1/0.001 and a momentum of 0.9, with cosine annealing learning rate scheduling for 200 epochs. For PACS, we fine-tuned the pre-trained weights from Image Net on the selected source domains for 3,000 iterations using the Adam optimizer with a learning rate of 0.0001. During adaptation, we update all parameters, including BN stats, with an SGD optimizer with a learning rate/epoch of 0.001/3 (PACS), 0.0001/3 (CIFAR10-C, CIFAR100-C), and 0.00005/5 (Tiny-Image Net-C) on the entire model. We applied weight decay of 0.05 to PACS and 0.0 otherwise. ... With 4 dropout instances, we apply a dropout rate of 0.3 for small-scale datasets (e.g., CIFAR10-C, CIFAR100-C, PACS) and 0.1 for large-scale datasets (e.g., Tiny-Image Net-C, Image Net-C).