CyCLIP: Cyclic Contrastive Language-Image Pretraining

Authors: Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan Rossi, Vishwa Vinay, Aditya Grover

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that the improved consistency in CYCLIP translates to significant gains over CLIP, with gains ranging from 10% 24% for zero-shot classification accuracy on standard benchmarks (CIFAR-10, CIFAR-100, Image Net1K) and 10% 27% for robustness to various natural distribution shifts.
Researcher Affiliation Collaboration Shashank Goel UCLA shashankgoel@ucla.edu Hritik Bansal UCLA hbansal@ucla.edu Sumit Bhatia MDSR Lab, Adobe Systems sumit.bhatia@adobe.com Ryan A. Rossi Adobe Research ryrossi@adobe.com Vishwa Vinay Adobe Research vinay@adobe.com Aditya Grover UCLA adityag@cs.ucla.edu
Pseudocode No The paper does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block or figure.
Open Source Code Yes The code is available at https://github.com/goel-shashank/Cy CLIP.
Open Datasets Yes We use Conceptual Captions 3M [52] (CC3M) image-caption pairs as the source of multimodal pretraining data for all our models. We compare the zero-shot performance of CLIP and CYCLIP on standard image classification datasets: CIFAR-10, CIFAR-100 [31], and Image Net1K [49].
Dataset Splits Yes The consistency score is calculated over 10K, 10K, and 50K testing images of the CIFAR-10, CIFAR-100 and Image Net dataset respectively. We use 50K samples from the training set of each dataset for k-Nearest Neighbor prediction. We assess our models on the test set of Flickr30K (1K) and MSCOCO (5K) obtained from the well-known Karpathy [30] split.
Hardware Specification Yes Further, we train our models from scratch for 64 epochs on 4 V100 GPUs with a batch size of 128 and an initial learning rate of 0.0005 with cosine scheduling and 10000 warmup steps.
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used for the experiments.
Experiment Setup Yes Further, we train our models from scratch for 64 epochs on 4 V100 GPUs with a batch size of 128 and an initial learning rate of 0.0005 with cosine scheduling and 10000 warmup steps. The dimension of the image and text embeddings is 1024. For CYCLIP, we use λ1 = 0.25 and λ2 = 0.25 across all our experiments.