reproducibilityindex.ai

Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective

Authors: Yanan Zhang, Jiangmeng Li, Lixiang Liu, Wenwen Qiang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments Following previous works [5, 6], we conduct experiments to evaluate our proposed method with three different settings. These settings encompass the base-to-new setting as well as two out-of-distribution (OOD) settings, i.e., the cross-dataset setting and the cross-domain setting. Refer to Appendix B for a detailed overview of the evaluation protocol.
Researcher Affiliation	Academia	Yanan Zhang1,2, , Jiangmeng Li2, , Lixiang Liu1,2, Wenwen Qiang2, 1University of Chinese Academy of Sciences 2Institute of Software Chinese Academy of Sciences zhangyanan110199@gmail.com, {jiangmeng2019, lixiang, qiangwenwen}@iscas.ac.cn
Pseudocode	Yes	Algorithm 1 The training pipeline of CDC; Algorithm 2 The testing pipeline of CDC
Open Source Code	Yes	Furthermore, we provide the code of our proposed method in the supplementary material.
Open Datasets	Yes	In the base-to-new setting, we conduct experiments based on 11 datasets: Image Net [33], Caltech101 [34], Oxford Pets [35], Stanford Cars [36], Flowers102 [37], Food101 [38], FGVC Aircraft [39], SUN397 [40], DTD [9], Euro SAT [41], and UCF-101 [42].
Dataset Splits	No	The paper describes the division of classes into 'base classes' for training and 'new classes' for evaluation in the base-to-new setting, and details the '16-shot setting'. However, it does not explicitly provide details about a separate 'validation' dataset split or its size/purpose.
Hardware Specification	Yes	All models are trained using an SGD optimizer on an NVIDIA 3090 GPU.
Software Dependencies	No	The paper does not explicitly list software dependencies with specific version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA version). It mentions using CLIP, a foundational vision-language model, and the Ma PLe baseline method, but no detailed software environment information.
Experiment Setup	Yes	Specifically, we utilize a pre-trained CLIP with Vi T-B/16 as the visual encoder. The number of learnable tokens is fixed at 2, whereas the prompt depth varies, being 9 for the base-to-new setting and 3 for the OOD setting. The learning rate is 0.035, and the batch size is 4. All models are trained using an SGD optimizer on an NVIDIA 3090 GPU. Our proposed CDC introduces three additional hyperparameters: β and γ, which represent the weights for Lde and Lcon, respectively, and M, which denotes the number of prompts. We set β = 5, γ = 0.01, and M = 4 in the base-to-new setting, and β = 3, γ = 0.01, and M = 8 in the OOD setting.