Learning to Learn Better Visual Prompts
Authors: Fengxiang Wang, Wanrong Huang, Shaowu Yang, Qi Fan, Long Lan
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on benchmark datasets validate the efficacy of our meta-learning-informed prompt tuning, affirming its role as a robust optimization strategy for VLMs. We performed comprehensive experimental settings for the generalization capability from base classes to new classes on 7 image classification datasets. Evaluations reveal that our presented metalearning-based prompt tuning method is effective, obtaining a higher final performance on new classes than existing approaches. |
| Researcher Affiliation | Academia | Fengxiang Wang1, Wanrong Huang1, Shaowu Yang1*, Fan Qi2, Long Lan1 1HPCL, College of Computer Science and Technology, National University of Defense Technology 2The Hong Kong University of Science and Technology {wfx23, huangwanrong12, shaowu.yang}@nudt.edu.cn, fanqics@gmail.com,long.lan@nudt.edu.cn |
| Pseudocode | Yes | Algorithm 1: Meta-Learning Stage Require1: p(ζ): distribution over tasks Require2: α, β: step size hyperparameters Require3: M: Model after prompt-tuning stage 1: initialize θ with M 2: while not done do 3: Sample batch of tasks ζi p(ζ) 4: for ζi do 5: Evaluate θφζi(Mθ) with respect to K examples 6: Compute adapted parameters with gradient descent: θ i = θ α θ φζi(Mθ) 7: end for 8: Update θ θ β θ P ζi p(ζ) φζi(Mθ i) 9: end while |
| Open Source Code | No | Not found. The paper does not explicitly state that the source code for their method is open-source or provide a link to it. |
| Open Datasets | Yes | We evaluate the methods on 7 image classification datasets, which cover a diverse set of recognition tasks. Specifically, the benchmark includes Image Net (Deng et al. 2009) and Caltech101 (Fei-Fei, Fergus, and Perona 2004) for classification on generic objects; Flowers102 (Nilsback and Zisserman 2008) and FGVC Aircraft (Maji et al. 2013) for fine-grained classification; SUN397 (Xiao et al. 2010) for scene recognition; UCF101 (Soomro, Zamir, and Shah 2012) for action recognition; DTD (Cimpoi et al. 2014) for texture classification. |
| Dataset Splits | Yes | As well, for fair comparison, we divided the dataset into training, validation, and testing sets at a ratio of 2:1:1 in our methodology. The data for test is same. |
| Hardware Specification | No | Not found. The paper mentions 'backbones(Vi T-B/16 and Res Net-50)' which refers to model architectures, not specific hardware specifications like GPU or CPU models. |
| Software Dependencies | No | Not found. The paper states its implementation is based on Co Op, Kg Co Op, and CLIP, but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Training Details. Our model s underlying implementation is hinged on the approaches of Co Op and Kg Co Op, entwined with the CLIP model. For the task of generalization from base-to-new classes in Co Op, Co Co Op and Kg Co Op, we split the dataset categories into a train-test set at a ratio of 3:1. As well, for fair comparison, we divided the dataset into training, validation, and testing sets at a ratio of 2:1:1 in our methodology. The data for test is same. We conduct the experiments based on the vision backbone with Res Net-50 (He et al. 2016) and Vi T-B/16 (Dosovitskiy et al. 2020). Gleaning inspiration from Co Op, we determinedly fix the context length at 4 and not initialize the context vectors. And the class token position is end. The data augmentation methods are not adopted in our method. We use the setting of 5-way K-shot in meta training, K=1,2,4,8,16. |