Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
Authors: Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, James Y. Zou
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model s downstream zeroshot classification performance and fairness. The main objective of our paper is to i) empirically demonstrate the modality gap phenomenon across different data modalities and NN architectures; ii) explain how the gap arises and iii) show that the size of the gap can affect downstream applications. |
| Researcher Affiliation | Academia | Weixin Liang Stanford University wxliang@stanford.edu Yuhui Zhang Stanford University yuhuiz@stanford.edu Yongchan Kwon Columbia University yk3012@columbia.edu Serena Yeung Stanford University syyeung@stanford.edu James Zou Stanford University jamesz@stanford.edu |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and data are available at https://modalitygap.readthedocs.io/ We provide open-source implementation of our work at https://github.com/Weixin-Liang/ Modality-Gap. |
| Open Datasets | Yes | We extract 5,000 embeddings from the final layer of 3 pre-trained models respectively (Res Net, Vision Transformer, Text Transformer)2 on MSCOCO Caption [8]. We train both models on the MSCOCO Caption training set with batch size 64 and temperature = 1 100 (i.e., CLIP s learned temperature). |
| Dataset Splits | Yes | We extract 5,000 embeddings from the final layer of 3 pre-trained models respectively (Res Net, Vision Transformer, Text Transformer)2 on MSCOCO Caption [8]. Design To testify this hypothesis, we design a loss landscape probing experiment on n = 5, 000 image-caption pairs5 from the validation set of MSCOCO Caption dataset. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types). |
| Software Dependencies | No | The paper mentions models like ResNet, Vision Transformer, Text Transformer, and CLIP, but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We train both models on the MSCOCO Caption training set with batch size 64 and temperature = 1 100 (i.e., CLIP s learned temperature). To further investigate the impact of temperature on modality gap, we fine-tune CLIP under 6 different temperatures 2 { 1/100, 1/50, 1/20, 1/10, 1} respectively, on MSCOCO Caption training set with batch size 64. The final learned temperature is = 1/100 in CLIP. |