UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models
Authors: Xin Li, Sima Behpour, Thang Long Doan, Wenbin He, Liang Gou, Liu Ren
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20%. |
| Researcher Affiliation | Industry | Bosch Research North America, Bosch Center for Artificial Intelligence (BCAI) {xin.li9, sima.behpour, thang.doan, wenbin.he2, liang.gou, liu.ren}@us.bosch.com |
| Pseudocode | No | The paper mentions 'Algorithm 1' in the text ('The optimization is a one-stage and end-to-end process, shown in Algorithm 1.'), but an actual pseudocode or algorithm block labeled 'Algorithm 1' is not present in the provided text. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or provide a link to a code repository. |
| Open Datasets | Yes | We select seven image classification datasets that are widely used in evaluating the V-L model adaptation approach. These datasets constitute a comprehensive benchmark, covering a diverse set of vision tasks, including the classification of generic objects (Caltech101 [14]), actions (UCF101 [36]), fine-grained categories (Oxford Pets [29], FGVCAircraft [26], and Flowers102 [27]), as well as some specialized tasks such as recognizing texture (DTD [8]) and satellite imagery (Euro SAT [18]). |
| Dataset Splits | Yes | Table 8: Datasets Statistics. The detailed statistics of the 7 datasets and the hand-crafted prompts that are used for BLIP-2 zero-shot learning. |
| Hardware Specification | Yes | We optimize our model with a batch size of 256 for a total of 150 epochs on RTX 3090. |
| Software Dependencies | No | The paper mentions software components like BLIP-2, CLIP, and DINOv2 models, and optimizers such as Adam, but it does not specify any version numbers for these or other software libraries or frameworks (e.g., PyTorch, TensorFlow, Python versions). |
| Experiment Setup | Yes | Training Details For the base model, we use the best available vision backbone in BLIP-2, which is Vi T-G. Previous work [48] on prompt learning has shown that a shorter context length can lead to better and more robust performance. Therefore, we initialize the context vectors with a fixed length of 4. The two hyperparameters, τI and τC, are set to 0.5 and 1.0, respectively. Training is performed with the Adam optimizer and a learning rate of 0.0003. We optimize our model with a batch size of 256 for a total of 150 epochs on RTX 3090. |