SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models

Authors: XIAOSONG MA, Jie ZHANG, Song Guo, Wenchao Xu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental experimental results show that it achieves state-of the-art test-time adaptation performance on Image Net and nine other datasets. It is also shown that Swap Prompt can even achieve comparable performance with supervised prompt adaptation methods. We conduct extensive experiments on Image Net and its four variants, as well as nine other image classification datasets The empirical evaluation shows that our approach significantly outperforms current TPT methods and can even compete with supervised prompt adaptation methods on most datasets.
Researcher Affiliation Academia Xiaosong Ma Department of Computing The Hong Kong Polytechnic University Hong Kong, China xiaosong16.ma@connect.polyu.hk Jie Zhang Department of Computing The Hong Kong Polytechnic University Hong Kong, China jie-comp.zhang@polyu.edu.hk Song Guo Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong, China songguo@cse.ust.hk Wenchao Xu Department of Computing The Hong Kong Polytechnic University Hong Kong, China wenchao.xu@polyu.edu.hk
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. Figure 2 shows a framework but not an algorithm.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets Yes Dataset. We evaluate the proposed Swap Prompt over fourteen datasets, including Image Net [25] and its four variants: Image Net-V2 [26], Image Net-A [27], Image Net-R [28] and Image Net Sketch [29], and nine other publicly available image classification datasets used in CLIP: Caltech101 [30], DTD [31], Flowers102 [32], Oxford-Pets [33], UCF101 [34], Stanford Cars [35], Food101 [36], Euro SAT [37] and SUN397 [38].
Dataset Splits No The paper evaluates on test data and discusses pseudo-labeling on test data for adaptation, but does not explicitly specify traditional training, validation, or test dataset splits. It states: "We only use the test data to do adaptation and also evaluate models with them."
Hardware Specification Yes We do all experiments on a workstation with an RTX 3090 GPU, a 3.5-GHZ Intel Core i9-11900K CPU and 64GB of RAM.
Software Dependencies No The paper mentions using the "publicly available CLIP model with the Res Net-50 visual encoder" and "SGD optimizer" but does not provide specific version numbers for these or other software dependencies like Python or PyTorch.
Experiment Setup Yes Unless otherwise specified, the prompt is initialized randomly with 4 learnable tokens in Swap Prompt, UPL and Co Op. As for TPT, the prompt is initialized as the default one a photo of a . When comparing the performance with baselines, we select the top 16 test data with the highest confidence to train Swap Prompt and UPL. For Swap Prompt, the decay rate of target prompt is 0.99, both α and β are 1. We use the same image augmentation method as Sim CLR [40] to generate two different augmented images for an image. We optimize the prompts for 50 epochs with SGD optimizer and a cosine decay learning rate scheduler, the initial learning rate is 0.002. The batch size of images is 32 on all datasets.