reproducibilityindex.ai

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Authors: Maitreya Patel, Naga Sai Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, 'YZ' Yezhou Yang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that our method, named Triplet CLIP, when applied to existing datasets such as CC3M and CC12M, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the Sugar Crepe benchmark on an equal computational budget, as well as improvements in zero-shot image classification and image retrieval. Section 4, titled 'Experiments & Results', details comprehensive empirical evaluations.
Researcher Affiliation	Academia	Arizona State University University of Maryland, Baltimore County
Pseudocode	Yes	We provide the pseudo-code in the appendix and the code in supplementary materials. (Appendix B Pseudocode of Triplet CLIP)
Open Source Code	Yes	Our code, models, and data are available at: tripletclip.github.io.
Open Datasets	Yes	We utilize the CC3M and CC12M datasets, which comprise 2.6M and 8.6M image-text pairs, respectively. Following the approach demonstrated by La CLIP, we use LLM-rewritten captions to replace noisy original captions.
Dataset Splits	No	The paper mentions using CC3M and CC12M datasets for pretraining and various downstream tasks for evaluation, but it does not explicitly provide the training, validation, and test dataset splits (e.g., percentages or sample counts) for these datasets.
Hardware Specification	Yes	All models are trained on a single A100 (80GB) GPU using bf16 precision.
Software Dependencies	No	The paper mentions `bf16 precision` and `Adam W optimizer` but does not provide specific version numbers for software dependencies or libraries used (e.g., PyTorch, TensorFlow versions).
Experiment Setup	Yes	Our experiments employ the Vi T-B/32 [10] model architecture. To guarantee fair comparisons, we retrain all baseline models using identical hyperparameters. Since the overall training data for Neg CLIP and Triplet Data is more than the baseline datasets, we align the number of iterations across all models to equalize the number of image-text pairs seen during training, similar to the strategy used in Data Comp. The batch size is fixed to 1024 with the Adam W optimizer at a maximum learning rate of 0.0005, employing cosine decay. Training durations are set at approximately 100k iterations for CC3M and 200k iterations for CC12M. (Also refers to Table 10 in Appendix C for detailed hyperparameters).