reproducibilityindex.ai

Multilingual Diversity Improves Vision-Language Representations

Authors: Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei W. Koh, Ranjay Krishna

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We thus conduct a systematic study to explore the performance benefits of using more samples of non-English origins with respect to English vision tasks. Pre-training on this dataset outperforms using English-only or Englishdominated datasets on Image Net, Image Net distribution shifts, image-English-text retrieval and on average across 38 tasks from the Data Comp benchmark.
Researcher Affiliation	Collaboration	1University of Washington 2Allen Institute for Artificial Intelligence
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	We did not open-source the translation code, but we describe in detail in Section 3 how we apply translation to all samples in the pool and re-filter them.
Open Datasets	Yes	We experiment with the medium pool of the Data Comp benchmark [14], which consists of 128M image-text pairs randomly sampled from Common Crawl dumps between 2014 and 2022, and deduplicated. We will release the raw captions and the corresponding English translations for the 128M image-text pairs used in our experiments.
Dataset Splits	No	The paper describes using the Data Comp benchmark for pre-training and then evaluating on 38 separate tasks and benchmarks, but it does not specify a train/validation/test split of its primary 128M image-text dataset for its own experiments.
Hardware Specification	Yes	each of our baseline takes about 8 hours with 8 A40 GPUs and 40 CPUs.
Software Dependencies	No	The paper mentions using CLIP and specific optimizers and hyperparameters, but it does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	To summarize, we use Vi T-B/32 as the image encoder for CLIP, and fix the hyperparameters used for training: learning rate 5e-4, 500 warmup steps, batch size 4096, Adam W optimizer β2 = 0.98.