UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models

Authors: Xin Li, Sima Behpour, Thang Long Doan, Wenbin He, Liang Gou, Liu Ren

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20%.
Researcher Affiliation Industry Bosch Research North America, Bosch Center for Artificial Intelligence (BCAI) {xin.li9, sima.behpour, thang.doan, wenbin.he2, liang.gou, liu.ren}@us.bosch.com
Pseudocode No The paper mentions 'Algorithm 1' in the text ('The optimization is a one-stage and end-to-end process, shown in Algorithm 1.'), but an actual pseudocode or algorithm block labeled 'Algorithm 1' is not present in the provided text.
Open Source Code No The paper does not contain an explicit statement about releasing source code or provide a link to a code repository.
Open Datasets Yes We select seven image classification datasets that are widely used in evaluating the V-L model adaptation approach. These datasets constitute a comprehensive benchmark, covering a diverse set of vision tasks, including the classification of generic objects (Caltech101 [14]), actions (UCF101 [36]), fine-grained categories (Oxford Pets [29], FGVCAircraft [26], and Flowers102 [27]), as well as some specialized tasks such as recognizing texture (DTD [8]) and satellite imagery (Euro SAT [18]).
Dataset Splits Yes Table 8: Datasets Statistics. The detailed statistics of the 7 datasets and the hand-crafted prompts that are used for BLIP-2 zero-shot learning.
Hardware Specification Yes We optimize our model with a batch size of 256 for a total of 150 epochs on RTX 3090.
Software Dependencies No The paper mentions software components like BLIP-2, CLIP, and DINOv2 models, and optimizers such as Adam, but it does not specify any version numbers for these or other software libraries or frameworks (e.g., PyTorch, TensorFlow, Python versions).
Experiment Setup Yes Training Details For the base model, we use the best available vision backbone in BLIP-2, which is Vi T-G. Previous work [48] on prompt learning has shown that a shorter context length can lead to better and more robust performance. Therefore, we initialize the context vectors with a fixed length of 4. The two hyperparameters, τI and τC, are set to 0.5 and 1.0, respectively. Training is performed with the Adam optimizer and a learning rate of 0.0003. We optimize our model with a batch size of 256 for a total of 150 epochs on RTX 3090.