Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

Authors: Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, James Y. Zou

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model s downstream zeroshot classification performance and fairness. The main objective of our paper is to i) empirically demonstrate the modality gap phenomenon across different data modalities and NN architectures; ii) explain how the gap arises and iii) show that the size of the gap can affect downstream applications.
Researcher Affiliation Academia Weixin Liang Stanford University wxliang@stanford.edu Yuhui Zhang Stanford University yuhuiz@stanford.edu Yongchan Kwon Columbia University yk3012@columbia.edu Serena Yeung Stanford University syyeung@stanford.edu James Zou Stanford University jamesz@stanford.edu
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code and data are available at https://modalitygap.readthedocs.io/ We provide open-source implementation of our work at https://github.com/Weixin-Liang/ Modality-Gap.
Open Datasets Yes We extract 5,000 embeddings from the final layer of 3 pre-trained models respectively (Res Net, Vision Transformer, Text Transformer)2 on MSCOCO Caption [8]. We train both models on the MSCOCO Caption training set with batch size 64 and temperature = 1 100 (i.e., CLIP s learned temperature).
Dataset Splits Yes We extract 5,000 embeddings from the final layer of 3 pre-trained models respectively (Res Net, Vision Transformer, Text Transformer)2 on MSCOCO Caption [8]. Design To testify this hypothesis, we design a loss landscape probing experiment on n = 5, 000 image-caption pairs5 from the validation set of MSCOCO Caption dataset.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types).
Software Dependencies No The paper mentions models like ResNet, Vision Transformer, Text Transformer, and CLIP, but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We train both models on the MSCOCO Caption training set with batch size 64 and temperature = 1 100 (i.e., CLIP s learned temperature). To further investigate the impact of temperature on modality gap, we fine-tune CLIP under 6 different temperatures 2 { 1/100, 1/50, 1/20, 1/10, 1} respectively, on MSCOCO Caption training set with batch size 64. The final learned temperature is = 1/100 in CLIP.