Open-Vocabulary Calibration for Fine-tuned CLIP
Authors: Shuoyuan Wang, Jindong Wang, Guoqing Wang, Bob Zhang, Kaiyang Zhou, Hongxin Wei
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments with 7 distinct prompt learning methods applied across 11 diverse downstream datasets demonstrate the effectiveness of DAC, which achieves high efficacy without sacrificing the inference speed. Our code is available at https://github.com/mlstat-Sustech/CLIP Calibration. |
| Researcher Affiliation | Academia | 1Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China. 2Department of Computer and Information Science, University of Macau, Taipa, Macau 3William & Mary, Williamsburg, Virginia, USA 4School of Computer Science and Engineering, University of Electronic Science and Technology of China, China. 5Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China. Correspondence to: Hongxin Wei <weihx@sustech.edu.cn>. |
| Pseudocode | No | No pseudocode or algorithm blocks were explicitly labeled or formatted as such. |
| Open Source Code | Yes | Our code is available at https://github.com/mlstat-Sustech/CLIP Calibration. |
| Open Datasets | Yes | Datasets. In this work, we follow the standard base-to-new setting (Zhou et al., 2022b;a) and use 11 image recognition datasets in our experiment. The datasets cover diverse classification tasks. It includes general object datasets like Image Net (Deng et al., 2009) and Caltech101 (Fei-Fei et al., 2004), alongside fine-grained classification datasets such as Oxford Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), and FGVCAircraft (Maji et al., 2013) . |
| Dataset Splits | Yes | Following the base-to-new generalization setting in Co Co Op (Zhou et al., 2022a), we split the dataset into two subsets base and new. The labels for these subsets do not overlap. As is shown in Table 8, we split the dataset into three folds through our experiments. The pre-trained CLIP will be first tuned on base classes. For former calibration methods like (Guo et al., 2017), they need a calibration set to align the predicted probabilities more closely with the actual likelihoods of the outcomes. Since we can not access new (open-vocabulary) classes, we sample a calibration set from the datasets of base classes which does not overlap with the training set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using CLIP (Vi T-B/16) and refers to the standard codebase for implementation, but does not list specific software dependencies with version numbers (e.g., PyTorch version, Python version). |
| Experiment Setup | Yes | We fine-tune the model with 16 samples per class in a few-shot setting (Zhou et al., 2022a). For our proposed DAC, for simplicity, we set the number of nearest textual features K=5 throughout the paper to calculate TD scores. Following the official implementation, We list the general hyperparameters in Table 9. Here, we briefly introduce the corresponding exclusive hyperparameters of each VLM tuning method. All the methods are adopted from their official implementation. For Co Op and Co Co Op, they do not contain other hyperparameters. For Pro DA, we set λ = 0.1. For Kg Co Op, we set λ = 8.0. For Ma PLe, we set prompt depth J to 0 and the language and vision prompt lengths to 2. For Pro Grad, we set λ = 0.8. For Prompt SRC, we set deep prompting with V = T = 4. λ1 = 10 and λ2 = 25 are used in weight loss. For textual diversity, we use a total of N = 60 standard prompt templates. |