Revisiting the Role of Language Priors in Vision-Language Models

Authors: Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We explore their zero-shot performance on the illustrative task of image-text retrieval across nine popular vision-language benchmarks. Our first observation is that they can be repurposed for discriminative tasks (such as image-text retrieval) by simply computing the match score of generating a particular text string given an image.
Researcher Affiliation Collaboration *Equal contribution 1CMU 2Meta. Correspondence to: Zhiqiu Lin <zhiqiul@andrew.cmu.edu>.
Pseudocode No The paper describes methods and equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We run BLIP-ITC and BLIP-ITM using our own codebase, which will be released to the public.
Open Datasets Yes We leverage OTS image-conditioned language models to estimate Ptrain(t). Most of our diagnostic experiments focus on the open-sourced BLIP (Li et al., 2022; 2023) model, trained on public image-text corpora using discriminative (ITC and ITM) and generative (captioning) objectives. We comprehensively report on four recent I-to-T retrieval benchmarks that assess compositionality, including ARO (Yuksekgonul et al., 2022), Crepe (Ma et al., 2022), Sugar Crepe (Hsieh et al., 2023), and VL-Check List (Zhao et al., 2022). COCO (Lin et al., 2014) and Flickr30K (Young et al., 2014) for large-scale retrieval; (c) Image Net (Deng et al., 2009) for zero-shot image classification.
Dataset Splits Yes One can also do slightly better by using a held-out valset to tune for the optimal α [0, 1]. For Winoground and Eq Ben, we sample half of the data as a valset and perform a grid search for α val (using a step size of 0.001), reporting the performance on the other half. We repeat this process 10 times and report the mean and standard deviation. For COCO and Flickr30K, we perform αdebiasing using Recall@1 (R@1) on the official valset.
Hardware Specification Yes Computational resources. All experiments use a single NVIDIA Ge Force 3090s GPU.
Software Dependencies No The paper mentions using models like BLIP (Li et al., 2022; 2023) and LLMs like FLAN-T5 (Chung et al., 2022) or OPT (Chung et al., 2022), but it does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes BLIP and BLIP-2 experiments sample Gaussian noise images with a mean of 1.0 and a standard deviation of 0.25. By default, we use 100 images for Winoground, 30 images for Eq Ben, 1 image for Image Net, and 3 images for the rest of the benchmarks. For Winoground and Eq Ben, we sample half of the data as a valset and perform a grid search for α val (using a step size of 0.001), reporting the performance on the other half.