Multilingual Diversity Improves Vision-Language Representations

Authors: Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei W. Koh, Ranjay Krishna

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We thus conduct a systematic study to explore the performance benefits of using more samples of non-English origins with respect to English vision tasks. Pre-training on this dataset outperforms using English-only or Englishdominated datasets on Image Net, Image Net distribution shifts, image-English-text retrieval and on average across 38 tasks from the Data Comp benchmark.
Researcher Affiliation Collaboration 1University of Washington 2Allen Institute for Artificial Intelligence
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No We did not open-source the translation code, but we describe in detail in Section 3 how we apply translation to all samples in the pool and re-filter them.
Open Datasets Yes We experiment with the medium pool of the Data Comp benchmark [14], which consists of 128M image-text pairs randomly sampled from Common Crawl dumps between 2014 and 2022, and deduplicated. We will release the raw captions and the corresponding English translations for the 128M image-text pairs used in our experiments.
Dataset Splits No The paper describes using the Data Comp benchmark for pre-training and then evaluating on 38 separate tasks and benchmarks, but it does not specify a train/validation/test split of its primary 128M image-text dataset for its own experiments.
Hardware Specification Yes each of our baseline takes about 8 hours with 8 A40 GPUs and 40 CPUs.
Software Dependencies No The paper mentions using CLIP and specific optimizers and hyperparameters, but it does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes To summarize, we use Vi T-B/32 as the image encoder for CLIP, and fix the hyperparameters used for training: learning rate 5e-4, 500 warmup steps, batch size 4096, Adam W optimizer β2 = 0.98.