Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

Authors: Amit Peleg, Naman Deep Singh, Matthias Hein

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical results in Section 4 and in Appendix E to back up the contributions of the paper. ... Evaluation. We evaluated the models on the compositionality benchmarks Sugar Crepe++ [11], Wino Ground [52], and Sugar Crepe [18]. Additionally, we assess the effects of fine-tuning on downstream tasks such as classification on IMAGENET [6] and on ZS-10, which computes the average score across ten standard classification datasets, detailed in Appendix D.2. In addition, we report image (T I) and text retrieval (I T) Recall@5 scores on MS-COCO (Val2017) [29] and Flickr30k [43].
Researcher Affiliation Academia Amit Peleg Naman Deep Singh Tübingen AI Center, University of Tübingen Matthias Hein Equal contribution. Correspondence: EMAIL
Pseudocode Yes Algorithm 1 Training Procedure with Concatenated Images and Hard Negatives 1: function GENERATEPOSNEG(image pair (xi, xi+m), caption pair (yi, yi+m)) ... The pseudo-code can be found in Algorithm 1.
Open Source Code Yes All our models and code are available at https://clic-compositional-clip.github.io. ... Code will be released upon acceptance of the paper under MIT license.
Open Datasets Yes For CLIC, we fine-tune models either with Cog VLM [53] recaptioned Laion images, or Pixel Prose [50] recaptioned images from Red Caps and CC12M (see Appendix B for details). ... Evaluation. We evaluated the models on the compositionality benchmarks Sugar Crepe++ [11], Wino Ground [52], and Sugar Crepe [18]. Additionally, we assess the effects of fine-tuning on downstream tasks such as classification on IMAGENET [6] and on ZS-10, which computes the average score across ten standard classification datasets, detailed in Appendix D.2. In addition, we report image (T I) and text retrieval (I T) Recall@5 scores on MS-COCO (Val2017) [29] and Flickr30k [43].
Dataset Splits Yes We train for one epoch on our 1M Laion subset and on our 850k Pixel Prose dataset, while for MS-COCO, we trained for five epochs... The 10 zero-shot classification datasets we use are a subset from the CLIP_benchmark2. Specifically, we use 1k images of each of the following datasets: Country-211 [44], Caltech-101 [15], Oxford Pets [40], DTD [4], FGCV Aircrafts [37], Stanford Cars [22], Cifar-10,100 [23], Food-101 [2].
Hardware Specification Yes All the work was carried out on A100 40G GPUs. The training runs are across 4 GPUs.
Software Dependencies No We are using spa Cy [17] to extract the POS tags of captions as described in Appendix B.2. We use a VLM for creating captions, but the same results can be achieved with standard datasets, as corroborated by our experiments. ... We used the standard Adam W [34] optimizer with beta parameters (0.9, 0.98) and ϵ set to 1e 8 with a weight decay of 0.1. ... Next, we leverage the spa Cy package [17] to generate hard-negatives.
Experiment Setup Yes Training details. In all of our experiments for CLIC, we keep the vision encoder frozen and fine-tune only the text encoder at an image resolution of 224 224. ... In all of the experiments, the loss parameters in Eq. (7) are set to λCont = 1/2, λS-Neg = 1/2 and λUni = 1. We train for one epoch on our 1M Laion subset and on our 850k Pixel Prose dataset, while for MS-COCO, we trained for five epochs... All experiments are conducted at an image resolution of 224 224. Our effective batch size is 200 × 4, and we use a cosine scheduler, where the warm-up phase is 20% of the training time. The learning rate (LR) starts at 1e−7, peaks at 1e−6, and arrives at 1e−8 at the end. We used the standard Adam W [34] optimizer with beta parameters (0.9, 0.98) and ϵ set to 1e−8 with a weight decay of 0.1.