reproducibilityindex.ai

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Authors: Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, Junjie Yan

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show the effectiveness and efﬁciency of our De CLIP. As shown in Fig. 1, with a Res Net50 image encoder and a Transformer text encoder, our model can achieve 60.4% zero-shot top1 accuracy on Image Net, which is 0.8% above the CLIP-Res Net50 while using 7.1 fewer data. Our De CLIP-Res Net50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks.
Researcher Affiliation	Collaboration	Yangguang Li 1,, Feng Liang 2, Lichen Zhao 1, Yufeng Cui1, Wanli Ouyang3, Jing Shao1, Fengwei Yu1, Junjie Yan1 1Sense Time Research 2The University of Texas at Austin 3University of Sydney
Pseudocode	Yes	Appendix A PSEUDO CODE OF DECLIP, Algorithm 1 De CLIP
Open Source Code	Yes	Our code, dataset and models are released at: https://github.com/Sense-GVT/De CLIP
Open Datasets	Yes	Our code, dataset and models are released at: https://github.com/Sense-GVT/De CLIP and Our De CLIP full data consists of two parts: open-source data and web-crawled data. The open-source data comes from three different datasets: Conceptual Captions (CC3M) (Sharma et al., 2018), Conceptual 12M (CC12M) (Changpinyo et al., 2021), and YFCC (Thomee et al., 2016).
Dataset Splits	Yes	We determine the L2 regularization strength λ using a hyperparameter sweep on the validation sets over the range between 10 6 and 106, with 96 logarithmically spaced steps. The hyperparameter sweeps are performed on a validation split of each dataset.
Hardware Specification	Yes	Our R50 and V-B32 took 8/10 days to train on 80 V100 GPUs, respectively. Our largest De CLIP-Reg Net Y-64GF took 21 days on 160 V100 GPUs
Software Dependencies	No	The paper mentions optimizers like 'FP16-SGD optimizer' and 'Adam W optimizer' but does not specify software dependencies with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x).
Experiment Setup	Yes	The input resolution of the image encoder is 224 224, and the maximum context length of the text encoder is 76. The learnable temperature parameter τ is initialized to 0.07. The loss weights of additional supervision α, β and γ are all set to 0.2. For R50, we use the FP16-SGD optimizer with the batch size of 10,240 (128 80). Starting with an 0.01 learning rate (lr), we ﬁrst linearly increasing the lr to 0.2 (a.k.a warm-up) in one epoch. Then we use cosine anneal lr decay to decrease the lr. The weight decay is set to 0.0001.