TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives
Authors: Maitreya Patel, Naga Sai Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, 'YZ' Yezhou Yang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our method, named Triplet CLIP, when applied to existing datasets such as CC3M and CC12M, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the Sugar Crepe benchmark on an equal computational budget, as well as improvements in zero-shot image classification and image retrieval. Section 4, titled 'Experiments & Results', details comprehensive empirical evaluations. |
| Researcher Affiliation | Academia | Arizona State University University of Maryland, Baltimore County |
| Pseudocode | Yes | We provide the pseudo-code in the appendix and the code in supplementary materials. (Appendix B Pseudocode of Triplet CLIP) |
| Open Source Code | Yes | Our code, models, and data are available at: tripletclip.github.io. |
| Open Datasets | Yes | We utilize the CC3M and CC12M datasets, which comprise 2.6M and 8.6M image-text pairs, respectively. Following the approach demonstrated by La CLIP, we use LLM-rewritten captions to replace noisy original captions. |
| Dataset Splits | No | The paper mentions using CC3M and CC12M datasets for pretraining and various downstream tasks for evaluation, but it does not explicitly provide the training, validation, and test dataset splits (e.g., percentages or sample counts) for these datasets. |
| Hardware Specification | Yes | All models are trained on a single A100 (80GB) GPU using bf16 precision. |
| Software Dependencies | No | The paper mentions `bf16 precision` and `Adam W optimizer` but does not provide specific version numbers for software dependencies or libraries used (e.g., PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Our experiments employ the Vi T-B/32 [10] model architecture. To guarantee fair comparisons, we retrain all baseline models using identical hyperparameters. Since the overall training data for Neg CLIP and Triplet Data is more than the baseline datasets, we align the number of iterations across all models to equalize the number of image-text pairs seen during training, similar to the strategy used in Data Comp. The batch size is fixed to 1024 with the Adam W optimizer at a maximum learning rate of 0.0005, employing cosine decay. Training durations are set at approximately 100k iterations for CC3M and 200k iterations for CC12M. (Also refers to Table 10 in Appendix C for detailed hyperparameters). |