Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

Authors: Zixiang Chen, Yihe Deng, Yuanzhi Li, Quanquan Gu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on real data to confirm our theoretical predictions. Furthermore, inspired by our theoretical findings, we propose a new regularization technique for CLIP that effectively leads to improved zero-shot performance. Empirical results confirm that the proposed regularization can effectively improve the zero-shot performance across various tasks.
Researcher Affiliation Academia Department of Computer Science, University of California, Los Angeles Machine Learning Department, Carnegie Mellon University, Pittsburgh {chenzx19, yihedeng}@cs.ucla.edu yuanzhil@andrew.cmu.edu, qgu@cs.ucla.edu
Pseudocode No The paper presents mathematical formulations of loss functions and theoretical theorems, but it does not include any clearly labeled pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code Yes Our code is provided anonymously on Github*. *https://anonymous.4open.science/r/CLIP_theory-BC8F/README.md
Open Datasets Yes Datasets. For performance evaluation, we primarily focus on Conceptual Captions 3M (CC3M) (Sharma et al., 2018) as the pretraining dataset, in alignment with prior literature (Li et al., 2022; Goel et al., 2022). Additionally, we use MSCOCO (Chen et al., 2015) in order to conduct lightweight real data experiments to validate our theoretical findings.
Dataset Splits Yes Specifically, the dataset contains 82, 783 images where each image is coupled with 5 captions. We consider each image-caption pair as a data example in pre-training and therefore arrive at 413, 915 pre-training data pairs. We further randomly split the data to keep 20% of the data as validation set and stops training as the contrastive loss on validation data no longer decreases to avoid overfitting on the small dataset.
Hardware Specification Yes Lastly, our experiments can be feasibly ran on a single Ge Force RTX 2080 GPU.
Software Dependencies No The paper mentions using "Py Torch Image Models library (Wightman, 2019)" and "Huggingface Transformers library (Wolf et al., 2020)". While specific libraries are named with citations, explicit version numbers for these libraries or the underlying Python/PyTorch environment are not provided.
Experiment Setup Yes We follow the code framework in Shariatnia (2021) and use pre-trained Res Net-50 from the Py Torch Image Models library (Wightman, 2019) and pre-trained Distil BERT from the Huggingface Transformers library (Wolf et al., 2020). We further have linear projection layers on both image and text encoders, the same as in CLIP, and consider the embedding dimension to be 512. As we are training at small-scale data with pre-trained encoders, we follow Shariatnia (2021) and use Adam W optimizer with learning rate 1e-4 on the image encoder, 1e-5 on the text encoder, and 1e-3 on the projection layers, with weight decay coefficient 1e-3. [...] For our regularization term, we use a coefficient of λ = 0.1. As in CLIP, we set the temperature τ from 0.07, equivalently having maximum logit scale at 2.6593. Lastly, we use a training batch size of 32 and trained for 8 epochs in the results reported in section 7.2.