reproducibilityindex.ai

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

Authors: Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, James Y. Zou

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. Our experiments further demonstrate that varying the modality gap distance has a signiﬁcant impact in improving the model s downstream zeroshot classiﬁcation performance and fairness. The main objective of our paper is to i) empirically demonstrate the modality gap phenomenon across different data modalities and NN architectures; ii) explain how the gap arises and iii) show that the size of the gap can affect downstream applications.
Researcher Affiliation	Academia	Weixin Liang Stanford University wxliang@stanford.edu Yuhui Zhang Stanford University yuhuiz@stanford.edu Yongchan Kwon Columbia University yk3012@columbia.edu Serena Yeung Stanford University syyeung@stanford.edu James Zou Stanford University jamesz@stanford.edu
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and data are available at https://modalitygap.readthedocs.io/ We provide open-source implementation of our work at https://github.com/Weixin-Liang/ Modality-Gap.
Open Datasets	Yes	We extract 5,000 embeddings from the ﬁnal layer of 3 pre-trained models respectively (Res Net, Vision Transformer, Text Transformer)2 on MSCOCO Caption [8]. We train both models on the MSCOCO Caption training set with batch size 64 and temperature = 1 100 (i.e., CLIP s learned temperature).
Dataset Splits	Yes	We extract 5,000 embeddings from the ﬁnal layer of 3 pre-trained models respectively (Res Net, Vision Transformer, Text Transformer)2 on MSCOCO Caption [8]. Design To testify this hypothesis, we design a loss landscape probing experiment on n = 5, 000 image-caption pairs5 from the validation set of MSCOCO Caption dataset.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types).
Software Dependencies	No	The paper mentions models like ResNet, Vision Transformer, Text Transformer, and CLIP, but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We train both models on the MSCOCO Caption training set with batch size 64 and temperature = 1 100 (i.e., CLIP s learned temperature). To further investigate the impact of temperature on modality gap, we ﬁne-tune CLIP under 6 different temperatures 2 { 1/100, 1/50, 1/20, 1/10, 1} respectively, on MSCOCO Caption training set with batch size 64. The final learned temperature is = 1/100 in CLIP.