Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Authors: Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mido Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that Llip outperforms noncontextualized baselines like CLIP and Sig LIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a Vi T-G/14 encoder. |
| Researcher Affiliation | Collaboration | 1FAIR at Meta 2Mila, Universit e de Montr eal 3New York University. |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found in the paper. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of the code for the methodology described. |
| Open Datasets | Yes | Our models were trained on the Common Crawl data curated using the methodology presented in Xu et al. (2023). We use a dataset of 2.5B image-text pairs collected using the same parameters that was used in Xu et al. (2023). |
| Dataset Splits | Yes | We collect the embedding vectors of 5000 samples from Image Net s validation set ran-domly chosen. |
| Hardware Specification | Yes | The Vi T-B and Vi T-L models were trained on 128 V100 and A100 respectively. The larger models were trained on 256 A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions Py Torch and the Adam W optimizer, but does not specify version numbers for these software dependencies, only citing papers for them. |
| Experiment Setup | Yes | We pre-train our models with the Adam W optimizer (Kingma & Ba, 2017; Loshchilov & Hutter, 2017) with β2 = 0.95 as done by Zhai et al. (2023) to stabilize the pre-training. We use a learnable scale parameter a along with a learnable bias b for our objective following the initialization of Zhai et al. (2023). Otherwise, all other training decisions closely follow the ones used by Radford et al. (2021); Xu et al. (2023). For all of the Llip experiments, we fix M = 8 the number of heads in the cross-attention. Unless mentioned otherwise, the cross-attention s temperature τ = 5. |