Few-Shot Adversarial Prompt Learning on Vision-Language Models

Authors: Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, Tongliang Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We justify our claims through a series of experiments on 11 benchmark datasets covering multiple recognition tasks.
Researcher Affiliation Academia Yiwei Zhou School of Automation Beijing Institute of Technology zhouyiwei@bit.edu.cn Xiaobo Xia Sydney AI Centre University of Sydney xiaoboxia.uni@gmail.com Zhiwei Lin School of Automation Beijing Institute of Technology linzhiwei@bit.edu.cn Bo Han Department of Computer Science Hong Kong Baptist University bhanml@comp.hkbu.edu.hk Tongliang Liu Sydney AI Centre University of Sydney tongliang.liu@sydney.edu.au
Pseudocode Yes A Pipelines of Adversarial Prompt Learning and Testing For a better understanding of the designed algorithm, we describe our adversarial prompt learning and adversarial prompt testing pipeline in Algorithm 1 and Algorithm 2 respectively.
Open Source Code Yes Code is available at: https://github.com/lionel-w2/FAP.
Open Datasets Yes To evaluate the proposed method, we align with previous works [28, 33] and utilize 11 diverse image recognition datasets that span multiple vision tasks. Specifically, the datasets include two generic object datasets: Image Net-1K [20] and Caltech101 [32]; a texture recognition dataset: DTD [34]; five fine-grained object recognition datasets: FGVCAircraft [35], Oxford Pets [36], Flowers102 [37], Food101 [38], and Stanford Cars [39]; a scene recognition dataset: SUN397 [40]; an action recognition dataset: UCF101 [41]; and a satellite image classification dataset: Euro SAT [42].
Dataset Splits No The paper uses 'test dataset' and a 'few-shot dataset S' for training, but does not explicitly mention a 'validation set' or 'validation split' for hyperparameter tuning or model selection in its experimental setup details. Training is done for a fixed number of epochs.
Hardware Specification Yes Experiments of adversarial prompt tuning on the Image Net-1K dataset are carried out on a single NVIDIA RTX A40 GPU, while experiments on the other 10 datasets are performed on a single NVIDIA RTX 4090 GPU.
Software Dependencies Yes All experiments are conducted in an environment running Py Torch 1.10.1 and CUDA 11.3 on Python 3.8.
Experiment Setup Yes All models are trained for 5 epochs in cross-dataset evaluation and 10 epochs for other benchmark settings by using an SGD optimizer with a momentum of 0.9. The initial learning rate is set at 0.0035. We apply a cosine learning rate scheduler and a warm-up strategy during the first epoch. For adversarial prompt learning, we use token prompts of size 2 in both the vision and text branches across the first 9 transformer blocks. Attacks are generated under ℓ threat model through a 2-step PGD attack, with a perturbation boundary ϵ = 1/255 and a step size α = 1/255, following the methodologies outlined in [11].