Multilingual Diversity Improves Vision-Language Representations
Authors: Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei W. Koh, Ranjay Krishna
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We thus conduct a systematic study to explore the performance benefits of using more samples of non-English origins with respect to English vision tasks. Pre-training on this dataset outperforms using English-only or Englishdominated datasets on Image Net, Image Net distribution shifts, image-English-text retrieval and on average across 38 tasks from the Data Comp benchmark. |
| Researcher Affiliation | Collaboration | 1University of Washington 2Allen Institute for Artificial Intelligence |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We did not open-source the translation code, but we describe in detail in Section 3 how we apply translation to all samples in the pool and re-filter them. |
| Open Datasets | Yes | We experiment with the medium pool of the Data Comp benchmark [14], which consists of 128M image-text pairs randomly sampled from Common Crawl dumps between 2014 and 2022, and deduplicated. We will release the raw captions and the corresponding English translations for the 128M image-text pairs used in our experiments. |
| Dataset Splits | No | The paper describes using the Data Comp benchmark for pre-training and then evaluating on 38 separate tasks and benchmarks, but it does not specify a train/validation/test split of its primary 128M image-text dataset for its own experiments. |
| Hardware Specification | Yes | each of our baseline takes about 8 hours with 8 A40 GPUs and 40 CPUs. |
| Software Dependencies | No | The paper mentions using CLIP and specific optimizers and hyperparameters, but it does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | To summarize, we use Vi T-B/32 as the image encoder for CLIP, and fix the hyperparameters used for training: learning rate 5e-4, 500 warmup steps, batch size 4096, Adam W optimizer β2 = 0.98. |