Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
Authors: Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, Junjie Yan
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show the effectiveness and efficiency of our De CLIP. As shown in Fig. 1, with a Res Net50 image encoder and a Transformer text encoder, our model can achieve 60.4% zero-shot top1 accuracy on Image Net, which is 0.8% above the CLIP-Res Net50 while using 7.1 fewer data. Our De CLIP-Res Net50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. |
| Researcher Affiliation | Collaboration | Yangguang Li 1,, Feng Liang 2, Lichen Zhao 1, Yufeng Cui1, Wanli Ouyang3, Jing Shao1, Fengwei Yu1, Junjie Yan1 1Sense Time Research 2The University of Texas at Austin 3University of Sydney |
| Pseudocode | Yes | Appendix A PSEUDO CODE OF DECLIP, Algorithm 1 De CLIP |
| Open Source Code | Yes | Our code, dataset and models are released at: https://github.com/Sense-GVT/De CLIP |
| Open Datasets | Yes | Our code, dataset and models are released at: https://github.com/Sense-GVT/De CLIP and Our De CLIP full data consists of two parts: open-source data and web-crawled data. The open-source data comes from three different datasets: Conceptual Captions (CC3M) (Sharma et al., 2018), Conceptual 12M (CC12M) (Changpinyo et al., 2021), and YFCC (Thomee et al., 2016). |
| Dataset Splits | Yes | We determine the L2 regularization strength λ using a hyperparameter sweep on the validation sets over the range between 10 6 and 106, with 96 logarithmically spaced steps. The hyperparameter sweeps are performed on a validation split of each dataset. |
| Hardware Specification | Yes | Our R50 and V-B32 took 8/10 days to train on 80 V100 GPUs, respectively. Our largest De CLIP-Reg Net Y-64GF took 21 days on 160 V100 GPUs |
| Software Dependencies | No | The paper mentions optimizers like 'FP16-SGD optimizer' and 'Adam W optimizer' but does not specify software dependencies with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | Yes | The input resolution of the image encoder is 224 224, and the maximum context length of the text encoder is 76. The learnable temperature parameter τ is initialized to 0.07. The loss weights of additional supervision α, β and γ are all set to 0.2. For R50, we use the FP16-SGD optimizer with the batch size of 10,240 (128 80). Starting with an 0.01 learning rate (lr), we first linearly increasing the lr to 0.2 (a.k.a warm-up) in one epoch. Then we use cosine anneal lr decay to decrease the lr. The weight decay is set to 0.0001. |