reproducibilityindex.ai

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Authors: Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mido Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that Llip outperforms noncontextualized baselines like CLIP and Sig LIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a Vi T-G/14 encoder.
Researcher Affiliation	Collaboration	1FAIR at Meta 2Mila, Universit e de Montr eal 3New York University.
Pseudocode	No	No pseudocode or clearly labeled algorithm block was found in the paper.
Open Source Code	No	The paper does not provide an explicit statement or link for the open-sourcing of the code for the methodology described.
Open Datasets	Yes	Our models were trained on the Common Crawl data curated using the methodology presented in Xu et al. (2023). We use a dataset of 2.5B image-text pairs collected using the same parameters that was used in Xu et al. (2023).
Dataset Splits	Yes	We collect the embedding vectors of 5000 samples from Image Net s validation set ran-domly chosen.
Hardware Specification	Yes	The Vi T-B and Vi T-L models were trained on 128 V100 and A100 respectively. The larger models were trained on 256 A100 80GB GPUs.
Software Dependencies	No	The paper mentions Py Torch and the Adam W optimizer, but does not specify version numbers for these software dependencies, only citing papers for them.
Experiment Setup	Yes	We pre-train our models with the Adam W optimizer (Kingma & Ba, 2017; Loshchilov & Hutter, 2017) with β2 = 0.95 as done by Zhai et al. (2023) to stabilize the pre-training. We use a learnable scale parameter a along with a learnable bias b for our objective following the initialization of Zhai et al. (2023). Otherwise, all other training decisions closely follow the ones used by Radford et al. (2021); Xu et al. (2023). For all of the Llip experiments, we fix M = 8 the number of heads in the cross-attention. Unless mentioned otherwise, the cross-attention s temperature τ = 5.