Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multilingual Diversity Improves Vision-Language Representations
Authors: Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei W. Koh, Ranjay Krishna
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We thus conduct a systematic study to explore the performance benefits of using more samples of non-English origins with respect to English vision tasks. Pre-training on this dataset outperforms using English-only or Englishdominated datasets on Image Net, Image Net distribution shifts, image-English-text retrieval and on average across 38 tasks from the Data Comp benchmark. |
| Researcher Affiliation | Collaboration | 1University of Washington 2Allen Institute for Artificial Intelligence |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We did not open-source the translation code, but we describe in detail in Section 3 how we apply translation to all samples in the pool and re-filter them. |
| Open Datasets | Yes | We experiment with the medium pool of the Data Comp benchmark [14], which consists of 128M image-text pairs randomly sampled from Common Crawl dumps between 2014 and 2022, and deduplicated. We will release the raw captions and the corresponding English translations for the 128M image-text pairs used in our experiments. |
| Dataset Splits | No | The paper describes using the Data Comp benchmark for pre-training and then evaluating on 38 separate tasks and benchmarks, but it does not specify a train/validation/test split of its primary 128M image-text dataset for its own experiments. |
| Hardware Specification | Yes | each of our baseline takes about 8 hours with 8 A40 GPUs and 40 CPUs. |
| Software Dependencies | No | The paper mentions using CLIP and specific optimizers and hyperparameters, but it does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | To summarize, we use Vi T-B/32 as the image encoder for CLIP, and fix the hyperparameters used for training: learning rate 5e-4, 500 warmup steps, batch size 4096, Adam W optimizer β2 = 0.98. |