Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Authors: Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mido Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that Llip outperforms noncontextualized baselines like CLIP and Sig LIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a Vi T-G/14 encoder. |
| Researcher Affiliation | Collaboration | 1FAIR at Meta 2Mila, Universit e de Montr eal 3New York University. |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found in the paper. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of the code for the methodology described. |
| Open Datasets | Yes | Our models were trained on the Common Crawl data curated using the methodology presented in Xu et al. (2023). We use a dataset of 2.5B image-text pairs collected using the same parameters that was used in Xu et al. (2023). |
| Dataset Splits | Yes | We collect the embedding vectors of 5000 samples from Image Net s validation set ran-domly chosen. |
| Hardware Specification | Yes | The Vi T-B and Vi T-L models were trained on 128 V100 and A100 respectively. The larger models were trained on 256 A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions Py Torch and the Adam W optimizer, but does not specify version numbers for these software dependencies, only citing papers for them. |
| Experiment Setup | Yes | We pre-train our models with the Adam W optimizer (Kingma & Ba, 2017; Loshchilov & Hutter, 2017) with β2 = 0.95 as done by Zhai et al. (2023) to stabilize the pre-training. We use a learnable scale parameter a along with a learnable bias b for our objective following the initialization of Zhai et al. (2023). Otherwise, all other training decisions closely follow the ones used by Radford et al. (2021); Xu et al. (2023). For all of the Llip experiments, we fix M = 8 the number of heads in the cross-attention. Unless mentioned otherwise, the cross-attention s temperature τ = 5. |