reproducibilityindex.ai

Open-Vocabulary Calibration for Fine-tuned CLIP

Authors: Shuoyuan Wang, Jindong Wang, Guoqing Wang, Bob Zhang, Kaiyang Zhou, Hongxin Wei

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments with 7 distinct prompt learning methods applied across 11 diverse downstream datasets demonstrate the effectiveness of DAC, which achieves high efficacy without sacrificing the inference speed. Our code is available at https://github.com/mlstat-Sustech/CLIP Calibration.
Researcher Affiliation	Academia	1Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China. 2Department of Computer and Information Science, University of Macau, Taipa, Macau 3William & Mary, Williamsburg, Virginia, USA 4School of Computer Science and Engineering, University of Electronic Science and Technology of China, China. 5Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China. Correspondence to: Hongxin Wei <weihx@sustech.edu.cn>.
Pseudocode	No	No pseudocode or algorithm blocks were explicitly labeled or formatted as such.
Open Source Code	Yes	Our code is available at https://github.com/mlstat-Sustech/CLIP Calibration.
Open Datasets	Yes	Datasets. In this work, we follow the standard base-to-new setting (Zhou et al., 2022b;a) and use 11 image recognition datasets in our experiment. The datasets cover diverse classification tasks. It includes general object datasets like Image Net (Deng et al., 2009) and Caltech101 (Fei-Fei et al., 2004), alongside fine-grained classification datasets such as Oxford Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), and FGVCAircraft (Maji et al., 2013) .
Dataset Splits	Yes	Following the base-to-new generalization setting in Co Co Op (Zhou et al., 2022a), we split the dataset into two subsets base and new. The labels for these subsets do not overlap. As is shown in Table 8, we split the dataset into three folds through our experiments. The pre-trained CLIP will be first tuned on base classes. For former calibration methods like (Guo et al., 2017), they need a calibration set to align the predicted probabilities more closely with the actual likelihoods of the outcomes. Since we can not access new (open-vocabulary) classes, we sample a calibration set from the datasets of base classes which does not overlap with the training set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using CLIP (Vi T-B/16) and refers to the standard codebase for implementation, but does not list specific software dependencies with version numbers (e.g., PyTorch version, Python version).
Experiment Setup	Yes	We fine-tune the model with 16 samples per class in a few-shot setting (Zhou et al., 2022a). For our proposed DAC, for simplicity, we set the number of nearest textual features K=5 throughout the paper to calculate TD scores. Following the official implementation, We list the general hyperparameters in Table 9. Here, we briefly introduce the corresponding exclusive hyperparameters of each VLM tuning method. All the methods are adopted from their official implementation. For Co Op and Co Co Op, they do not contain other hyperparameters. For Pro DA, we set λ = 0.1. For Kg Co Op, we set λ = 8.0. For Ma PLe, we set prompt depth J to 0 and the language and vision prompt lengths to 2. For Pro Grad, we set λ = 0.8. For Prompt SRC, we set deep prompting with V = T = 4. λ1 = 10 and λ2 = 25 are used in weight loss. For textual diversity, we use a total of N = 60 standard prompt templates.