Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP
Authors: Zixiang Chen, Yihe Deng, Yuanzhi Li, Quanquan Gu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on real data to confirm our theoretical predictions. Furthermore, inspired by our theoretical findings, we propose a new regularization technique for CLIP that effectively leads to improved zero-shot performance. Empirical results confirm that the proposed regularization can effectively improve the zero-shot performance across various tasks. |
| Researcher Affiliation | Academia | Department of Computer Science, University of California, Los Angeles Machine Learning Department, Carnegie Mellon University, Pittsburgh {chenzx19, yihedeng}@cs.ucla.edu yuanzhil@andrew.cmu.edu, qgu@cs.ucla.edu |
| Pseudocode | No | The paper presents mathematical formulations of loss functions and theoretical theorems, but it does not include any clearly labeled pseudocode blocks or algorithms in a structured, code-like format. |
| Open Source Code | Yes | Our code is provided anonymously on Github*. *https://anonymous.4open.science/r/CLIP_theory-BC8F/README.md |
| Open Datasets | Yes | Datasets. For performance evaluation, we primarily focus on Conceptual Captions 3M (CC3M) (Sharma et al., 2018) as the pretraining dataset, in alignment with prior literature (Li et al., 2022; Goel et al., 2022). Additionally, we use MSCOCO (Chen et al., 2015) in order to conduct lightweight real data experiments to validate our theoretical findings. |
| Dataset Splits | Yes | Specifically, the dataset contains 82, 783 images where each image is coupled with 5 captions. We consider each image-caption pair as a data example in pre-training and therefore arrive at 413, 915 pre-training data pairs. We further randomly split the data to keep 20% of the data as validation set and stops training as the contrastive loss on validation data no longer decreases to avoid overfitting on the small dataset. |
| Hardware Specification | Yes | Lastly, our experiments can be feasibly ran on a single Ge Force RTX 2080 GPU. |
| Software Dependencies | No | The paper mentions using "Py Torch Image Models library (Wightman, 2019)" and "Huggingface Transformers library (Wolf et al., 2020)". While specific libraries are named with citations, explicit version numbers for these libraries or the underlying Python/PyTorch environment are not provided. |
| Experiment Setup | Yes | We follow the code framework in Shariatnia (2021) and use pre-trained Res Net-50 from the Py Torch Image Models library (Wightman, 2019) and pre-trained Distil BERT from the Huggingface Transformers library (Wolf et al., 2020). We further have linear projection layers on both image and text encoders, the same as in CLIP, and consider the embedding dimension to be 512. As we are training at small-scale data with pre-trained encoders, we follow Shariatnia (2021) and use Adam W optimizer with learning rate 1e-4 on the image encoder, 1e-5 on the text encoder, and 1e-3 on the projection layers, with weight decay coefficient 1e-3. [...] For our regularization term, we use a coefficient of λ = 0.1. As in CLIP, we set the temperature τ from 0.07, equivalently having maximum logit scale at 2.6593. Lastly, we use a training batch size of 32 and trained for 8 epochs in the results reported in section 7.2. |