Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning
Authors: Cristina Menghini, Andrew Delworth, Stephen Bach
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct an extensive exploration of learning scenarios by varying prompt modalities, learning paradigms, and training strategies. We present empirical evidence showcasing the effectiveness of iterative prompt-training strategies that leverage CLIP-based pseudolabels, regardless of learning paradigms and prompt modalities, resulting in significant improvements in CLIP s image classification performance across different settings. We conduct experiments on six tasks where CLIP has been observed to underperform [31], such as satellite-image classification, flower-species identification, and texture-image recognition, among others. |
| Researcher Affiliation | Academia | Cristina Menghini Brown University cristina_menghini@brown.edu Andrew Delworth Brown University adelwort@cs.brown.edu Stephen H. Bach Brown University sbach@cs.brown.edu |
| Pseudocode | No | The paper does not include any figures, blocks, or sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps for a method or procedure formatted like code. |
| Open Source Code | Yes | The code to reproduce the experiments is at Bats Research/menghini-neurips23-code. |
| Open Datasets | Yes | We conduct the analysis on six tasks, covering specialized and fine-grained domains, where CLIP shows deficiencies [31]. We call this set of tasks FRAMED, and it includes Flowers102 [29], RESICS45 [9], FGVC-Aircraft [26], MNIST [11], Euro SAT [14], and DTD [10]. For each dataset we use the training and test splits provided in [23]. For the transductive zero-shot learning setting we randomly generate three splits of seen and unseen classes with a 62-38 ratio. Further details are in Appendix A.2. We use the datasets gathered by the recent ELEVATER [23] benchmark for vision-language models. |
| Dataset Splits | Yes | For each dataset we use the training and test splits provided in [23]. For the transductive zero-shot learning setting we randomly generate three splits of seen and unseen classes with a 62-38 ratio. The number of iterations I is 10. For training, the batch size is 64. |
| Hardware Specification | No | The paper mentions using specific model architectures like 'Vi T-B/32' and 'Vi T-L/14' as vision backbones. However, it does not specify the underlying hardware components (e.g., specific GPU models, CPU models, memory details, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'SGD as the optimizer' and discusses deep learning concepts that imply the use of frameworks like PyTorch or TensorFlow. However, it does not provide specific version numbers for any programming languages, libraries, or software environments used to conduct the experiments. |
| Experiment Setup | Yes | For both visual and textual prompt learning, we set the prefix size to 16 [48, 18]. Multimodal prompts have length 8 [44]. We use SGD as the optimizer and train for 150 epochs. We use 5 warmup epochs at a learning rate of 0.0001, and then set the learning rate to l, which is decayed by the cosine annealing rule. For textual and visual prompt learning, l = 0.1, while for multimodal prompt learning, l = 0.01. In SSL, we use 2 labeled samples per class to assess the impact of pseudolabels in the scenario of very few labeled data and abundant unlabeled data. The number of iterations I is 10. FPL and IFPL have the number of pseudolabels per class fixed to 16 since it is indicated as the optimal K in the previous research on pseudolabeling with CLIP [15]. For training, the batch size is 64. |