Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Authors: Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mido Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that Llip outperforms noncontextualized baselines like CLIP and Sig LIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a Vi T-G/14 encoder.
Researcher Affiliation Collaboration 1FAIR at Meta 2Mila, Universit e de Montr eal 3New York University.
Pseudocode No No pseudocode or clearly labeled algorithm block was found in the paper.
Open Source Code No The paper does not provide an explicit statement or link for the open-sourcing of the code for the methodology described.
Open Datasets Yes Our models were trained on the Common Crawl data curated using the methodology presented in Xu et al. (2023). We use a dataset of 2.5B image-text pairs collected using the same parameters that was used in Xu et al. (2023).
Dataset Splits Yes We collect the embedding vectors of 5000 samples from Image Net s validation set ran-domly chosen.
Hardware Specification Yes The Vi T-B and Vi T-L models were trained on 128 V100 and A100 respectively. The larger models were trained on 256 A100 80GB GPUs.
Software Dependencies No The paper mentions Py Torch and the Adam W optimizer, but does not specify version numbers for these software dependencies, only citing papers for them.
Experiment Setup Yes We pre-train our models with the Adam W optimizer (Kingma & Ba, 2017; Loshchilov & Hutter, 2017) with β2 = 0.95 as done by Zhai et al. (2023) to stabilize the pre-training. We use a learnable scale parameter a along with a learnable bias b for our objective following the initialization of Zhai et al. (2023). Otherwise, all other training decisions closely follow the ones used by Radford et al. (2021); Xu et al. (2023). For all of the Llip experiments, we fix M = 8 the number of heads in the cross-attention. Unless mentioned otherwise, the cross-attention s temperature τ = 5.