Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Object-centric binding in Contrastive Language-Image Pretraining
Authors: Rim Assouel, Pietro Astolfi, Florian Bordes, Michal Drozdzal, Adriana Romero-Soriano
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate OC-CLIP s inductive biases in 3 different settings: Addressing CLIP s binding problem. We show the efficiency of OC-CLIP in addressing the binding problem compared to finegrained hard-negative based augmentation on a synthetic dataset.(Section 4.1). Compositional understanding. We showcase OC-CLIP s compositional understanding, in domain, on real-world object-centric attribute binding and spatial relationship understanding benchmarks (Section 4.2). Scaling on noisy data. We show that OC-CLIP consistently outperforms a CLIP-based model in both zero-shot single object classification and zero-shot compositional understanding multi-object text retrieval, when training both models fully from scratch on larger-scale and noisy dataset (Section 4.3). |
| Researcher Affiliation | Collaboration | Rim Assouel1,2,3 Florian Bordes1 Pietro Astolfi1 Michal Drozdzal1 Adriana Romero-Soriano1,2,4,5 1FAIR at Meta 2Mila Qu ebec AI Institute 3Universit e de Montr eal 4Mc Gill University 5Canada CIFAR AI Chair |
| Pseudocode | Yes | A.10 Binding Module Code See Figure 16 |
| Open Source Code | No | Limitations and Future Work. Our OC-CLIP model has several limitations and avenues for future work. Notably, our approach relies on a parser to extract object-centric attributes and spatial relationships from text descriptions. While we have chosen an LLM-based parser, which is discussed in Appendix A.4, studying the different biases of LLM-based parser families could be interesting. Additionally, while we show the scaling potential of OC-CLIP at 15M scale ( A.1, A.3), the model needs further scaling to be fully comparable to all the CLIP variants, trained at least at 400M scale. |
| Open Datasets | Yes | The training text descriptions representing positive samples are taken from COCO [Lin et al., 2014], Visual-Genome (VG) [Krishna et al., 2017] and GQA [Hudson and Manning, 2019]. ... In particular, we use Sugar Crepe [Hsieh et al., 2023b] and ARO-A [Yuksekgonul et al., 2023a] for attribute binding and ARO-Relation (ARO-R) [Yuksekgonul et al., 2023a], COCO-spatial and GQA-spatial [Kamath et al., 2023] for spatial relationship understanding. |
| Dataset Splits | No | For the compositional experiments we train both Open CLIP and OC-CLIP on a aggregated data form COCO-Captions (COCO) [Lin et al., 2014], Visual Genome (VG) [Krishna et al., 2017] and GQA [Hudson and Manning, 2019]. All these datasets cover the same 110k images from COCO but focus on different kind of annotations. |
| Hardware Specification | Yes | Both models were trained using 4x8 V100 GPUS with a local batch size of 128. ... For the cc3m and cc12m, in order to accelerate the parsing, we kept the LLM parser local using ollama5 on v100 GPUs. |
| Software Dependencies | No | The paper mentions `Adam W optimizer`, `llama3-8b`, `ollama5` (implied software), `spacy [Honnibal and Montani, 2017]`, and `T5 model [Li et al., 2023b]` as tools or models used, but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA that would be needed for replication. |
| Experiment Setup | Yes | We use a batch size of 128 and a learning rate of 2 10 4 to train OC-CLIP for 100 epochs. We use a batch size of 256 following previous finetuning approaches [Kamath et al., 2023, Yuksekgonul et al., 2023b] and a learning rate of 4 10 6 for 20 epochs to finetune the Open CLIP baseline. ... Both CLIP and OC-CLIP architectures are trained fully from scratch for 5, 15, or 25 epochs, using a batch size of 4096, a learning rate of 1 10 3, 2k steps of learning rate warm-up, and a cosine decay. As recommended by Mu et al. [2021], we use Adam W optimizer with 0.5 of weight decay and β2 set to 0.98. |