CoPL: Contextual Prompt Learning for Vision-Language Understanding

Authors: Koustava Goswami, Srikrishna Karanam, Prateksha Udhayanan, K J Joseph, Balaji Vasan Srinivasan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive set of experiments on a variety of standard and few-shot datasets show that our method produces substantially improved performance when compared to the current state of the art methods. We also demonstrate both few-shot and out-of-distribution performance to establish the utility of learning dynamic prompts that are aligned to local image features. We conduct a comprehensive set of experiments on visual classification on 11 different datasets and scenarios (zero-shot, one-shot, seen/unseen, and within-dataset and crossdataset).
Researcher Affiliation Industry Adobe Research, Bangalore India koustavag@adobe.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets Yes We follow Zhou et al. (Zhou et al. 2022b) to evaluate our model on 11 image classification dataset of varying complexity. The datasets include: generic classification datasets like Image Net (Deng et al. 2009b) and Caltech101 (Fei-Fei, Fergus, and Perona 2004); curated fine-grained datasets like Oxford Pets (Parkhi et al. 2012), Stanford Cars (Krause et al. 2013), Flowers102 (Nilsback and Zisserman 2008), Food101 (Bossard, Guillaumin, and Gool 2014) and FGVCAircraft (Maji et al. 2013); scene, action, texture and satellite image recognition datasets from SUN397 (Xiao et al. 2010), UCF101 (Soomro, Zamir, and Shah 2012), DTD (Cimpoi et al. 2014) and Euro Sat (Helber et al. 2018) respectively.
Dataset Splits Yes For the few-shot experiments, we follow Zhou et al. (Zhou et al. 2022a) to randomly sample datapoints for training and evaluate on the entire test set. While training is only conducted on the base classes, during testing we transfer the learnt knowledge to classify unseen classes as well as seen classes. Here, we train Co PL and Co Co Op with 1 training instance per class for each of the image recognition task and test the accuracy on both seen and unseen classes within the dataset.
Hardware Specification Yes All our models are trained with a batch size 1 for 10 epochs on a single 16 GB Tesla T4 GPU system.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes Our prompt token length is 4. All our models are trained with a batch size 1 for 10 epochs on a single 16 GB Tesla T4 GPU system. Our starting learning rate is 0.002 and used cosine learning rate scheduler. Our warm-up with a constant learning rate is 0.00001.