Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective
Authors: Yanan Zhang, Jiangmeng Li, Lixiang Liu, Wenwen Qiang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments Following previous works [5, 6], we conduct experiments to evaluate our proposed method with three different settings. These settings encompass the base-to-new setting as well as two out-of-distribution (OOD) settings, i.e., the cross-dataset setting and the cross-domain setting. Refer to Appendix B for a detailed overview of the evaluation protocol. |
| Researcher Affiliation | Academia | Yanan Zhang1,2, , Jiangmeng Li2, , Lixiang Liu1,2, Wenwen Qiang2, 1University of Chinese Academy of Sciences 2Institute of Software Chinese Academy of Sciences zhangyanan110199@gmail.com, {jiangmeng2019, lixiang, qiangwenwen}@iscas.ac.cn |
| Pseudocode | Yes | Algorithm 1 The training pipeline of CDC; Algorithm 2 The testing pipeline of CDC |
| Open Source Code | Yes | Furthermore, we provide the code of our proposed method in the supplementary material. |
| Open Datasets | Yes | In the base-to-new setting, we conduct experiments based on 11 datasets: Image Net [33], Caltech101 [34], Oxford Pets [35], Stanford Cars [36], Flowers102 [37], Food101 [38], FGVC Aircraft [39], SUN397 [40], DTD [9], Euro SAT [41], and UCF-101 [42]. |
| Dataset Splits | No | The paper describes the division of classes into 'base classes' for training and 'new classes' for evaluation in the base-to-new setting, and details the '16-shot setting'. However, it does not explicitly provide details about a separate 'validation' dataset split or its size/purpose. |
| Hardware Specification | Yes | All models are trained using an SGD optimizer on an NVIDIA 3090 GPU. |
| Software Dependencies | No | The paper does not explicitly list software dependencies with specific version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA version). It mentions using CLIP, a foundational vision-language model, and the Ma PLe baseline method, but no detailed software environment information. |
| Experiment Setup | Yes | Specifically, we utilize a pre-trained CLIP with Vi T-B/16 as the visual encoder. The number of learnable tokens is fixed at 2, whereas the prompt depth varies, being 9 for the base-to-new setting and 3 for the OOD setting. The learning rate is 0.035, and the batch size is 4. All models are trained using an SGD optimizer on an NVIDIA 3090 GPU. Our proposed CDC introduces three additional hyperparameters: β and γ, which represent the weights for Lde and Lcon, respectively, and M, which denotes the number of prompts. We set β = 5, γ = 0.01, and M = 4 in the base-to-new setting, and β = 3, γ = 0.01, and M = 8 in the OOD setting. |