Robust Test-Time Adaptation for Zero-Shot Prompt Tuning

Authors: Ding-Chu Zhang, Zhi Zhou, Yu-Feng Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments on several benchmarks demonstrate that ADAPROMPT alleviates model bias, adapts to data bias and mostly outperforms the state-of-the-art methods at a small time cost.
Researcher Affiliation Academia National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China
Pseudocode Yes Algorithm 1: Confidence-aware Buffer Input: sample xt, pseudo label ˆy(xt), confidence c(xt) Parameter: threshold τ 1: if c(xt) > τ then 2: if buffer is not full then 3: Add(xt,ˆy(xt),c(xt)) 4: else 5: M majority class(es) in buffer 6: if ˆy(xt) / M then 7: Randomly select a class and discard one instance (xi,ˆy(xi),c(xi)) with the lowest confidence in that class where ˆy(xi) M 8: Add(xt,ˆy(xt),c(xt)) 9: else 10: c(xj) the minimum confident value in class ˆy(xt) 11: if c(xj) < c(xt) then 12: Discard the instance (xj,ˆy(xj),c(xj)) in buffer 13: Add(xt,ˆy(xt),c(xt)) 14: end if 15: end if 16: end if 17: end if
Open Source Code No The paper does not provide an explicit statement about the release of source code or a link to a code repository.
Open Datasets Yes We conduct experiments on two standard benchmarks: CIFAR10-C and CIFAR100-C (Hendrycks and Dietterich 2019)
Dataset Splits No Different from the previous methods that require training on the training set, we directly update prompts with unlabeled test data and then predict on them.
Hardware Specification No No specific hardware details such as GPU models, CPU models, or cloud computing instance types were mentioned for running experiments.
Software Dependencies No The paper mentions models and optimizers (e.g., CLIP, Adam W) but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, or TensorFlow versions, or specific library versions).
Experiment Setup Yes For ADAPROMPT , we set 64 as our buffer size and three different hand-crafted prompts for ensembling, which are an image of a , a colorful image of a and a noisy picture of a . Moreover, we set the batch size to 64 following previous studies (Boudiaf et al. 2022; Niu et al. 2022). The Adam W optimizer optimizes all the prompts with a learning rate of 0.005. We report mean std accuracy over five runs with random seed setting to 0, 1, 2, 3, 4.