Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models
Authors: Yubin Wang, Xinyang Jiang, De Cheng, Dongsheng Li, Cairong Zhao
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our HPT shows strong effectiveness and generalizes much better than existing SOTA methods. Our code is available at https:// github.com/Vill-Lab/2024-AAAI-HPT. Experimental Setup To evaluate our method, we follow the experiment setup established in previous works such as Co Op (Zhou et al. 2022), Co Co Op (Zhou et al. 2022), and Ma PLe (Khattak et al. 2023). |
| Researcher Affiliation | Collaboration | Yubin Wang1, Xinyang Jiang2, De Cheng3, Dongsheng Li2, Cairong Zhao1* 1Tongji University 2Microsoft Research Asia 3Xidian University |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https:// github.com/Vill-Lab/2024-AAAI-HPT. |
| Open Datasets | Yes | Datasets For base-to-new generalization and cross-dataset evaluation, we follow the prior work (Zhou et al. 2022) and evaluate the performance of our method on 11 image recognition datasets, which cover a wide range of recognition tasks. Specifically, the benchmark includes Image Net (Deng et al. 2009) and Caltech101 (Fei-Fei, Fergus, and Perona 2004) for classification on generic objects; Oxford Pets (Parkhi et al. 2012), Stanford Cars (Krause et al. 2013), Flowers102 (Nilsback and Zisserman 2008), Food101 (Bossard, Guillaumin, and Gool 2014) and FGVCAircraft (Maji et al. 2013) for fine-grained classification; SUN397 (Xiao et al. 2010) for scene recognition; UCF101 (Soomro, Zamir, and Shah 2012) for action recognition; DTD (Cimpoi et al. 2014) for texture classification; and finally Euro SAT (Helber et al. 2019) for satellite imagery recognition. |
| Dataset Splits | Yes | To evaluate our method, we follow the experiment setup established in previous works such as Co Op (Zhou et al. 2022), Co Co Op (Zhou et al. 2022), and Ma PLe (Khattak et al. 2023). We select 16 shots for training and the entire test set for evaluation. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU/CPU models or specific computational resources used for experiments. |
| Software Dependencies | No | The paper mentions using pre-trained CLIP and Chat GPT models, but it does not list specific software dependencies with their version numbers required for reproduction. |
| Experiment Setup | Yes | We utilize SGD optimization with an initial learning rate of 0.0025 for base-to-new generalization and 0.001 for other tasks. Following the prior work (Zhao et al. 2022), the cross-entropy loss is adopted to equally minimize the discrepancy between the ground-truth label and the three aforementioned distributions pi, pt, po, while the overall distribution po is used for inference. We randomly pick one description for each category to conduct relationship-guided attention learning during training for saving memory while leveraging all Nh descriptions per category for inference. For base-to-new generalization, the maximum epoch is set to 10, with a batch size of 8. The length of global-level prompts Ng is set to 2, and the number of descriptions for each category Nh, which is also the length of high-level prompts is set to 5. In accordance with the prior work (Zhou et al. 2022), we select 16 shots for training and the entire test set for evaluation. For domain generalization and crossdataset evaluation, the maximum epoch is set to 3, with a batch size of 8, where we use the same hyperparameters for each dataset instead of a separate search. |