CyCLIP: Cyclic Contrastive Language-Image Pretraining
Authors: Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan Rossi, Vishwa Vinay, Aditya Grover
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that the improved consistency in CYCLIP translates to significant gains over CLIP, with gains ranging from 10% 24% for zero-shot classification accuracy on standard benchmarks (CIFAR-10, CIFAR-100, Image Net1K) and 10% 27% for robustness to various natural distribution shifts. |
| Researcher Affiliation | Collaboration | Shashank Goel UCLA shashankgoel@ucla.edu Hritik Bansal UCLA hbansal@ucla.edu Sumit Bhatia MDSR Lab, Adobe Systems sumit.bhatia@adobe.com Ryan A. Rossi Adobe Research ryrossi@adobe.com Vishwa Vinay Adobe Research vinay@adobe.com Aditya Grover UCLA adityag@cs.ucla.edu |
| Pseudocode | No | The paper does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block or figure. |
| Open Source Code | Yes | The code is available at https://github.com/goel-shashank/Cy CLIP. |
| Open Datasets | Yes | We use Conceptual Captions 3M [52] (CC3M) image-caption pairs as the source of multimodal pretraining data for all our models. We compare the zero-shot performance of CLIP and CYCLIP on standard image classification datasets: CIFAR-10, CIFAR-100 [31], and Image Net1K [49]. |
| Dataset Splits | Yes | The consistency score is calculated over 10K, 10K, and 50K testing images of the CIFAR-10, CIFAR-100 and Image Net dataset respectively. We use 50K samples from the training set of each dataset for k-Nearest Neighbor prediction. We assess our models on the test set of Flickr30K (1K) and MSCOCO (5K) obtained from the well-known Karpathy [30] split. |
| Hardware Specification | Yes | Further, we train our models from scratch for 64 epochs on 4 V100 GPUs with a batch size of 128 and an initial learning rate of 0.0005 with cosine scheduling and 10000 warmup steps. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies or libraries used for the experiments. |
| Experiment Setup | Yes | Further, we train our models from scratch for 64 epochs on 4 V100 GPUs with a batch size of 128 and an initial learning rate of 0.0005 with cosine scheduling and 10000 warmup steps. The dimension of the image and text embeddings is 1024. For CYCLIP, we use λ1 = 0.25 and λ2 = 0.25 across all our experiments. |