Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Meta CLIP 2: A Worldwide Scaling Recipe

Authors: Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, Jim Glass, Lifei Huang, Jason E Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Scott Yih, Shang-Wen Li, Hu Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first present the main ablations of Meta CLIP 2 on a wide range of English and multilingual zero-shot transfer benchmarks, along with other multilingual CLIP baselines for comparison (Sec. 4.2.1); then we conduct a comprehensive ablation study on the variants of metadata, curation and tokenizer (Sec. 4.2.2). Lastly, we evaluate the embedding quality of Meta CLIP 2 on downstream tasks for culture diversity (Sec. 4.3) and building MLLM (Sec. 4.4). Additionally, we conduct analysis on embedding alignment and uniformity [41] (Sec. 4.5) and distillation (Sec. 4.6).
Researcher Affiliation Collaboration Yung-Sung Chuang1,2 , Yang Li1, Dong Wang1, Ching-Feng Yeh1, Kehan Lyu1, Ramya Raghavendra1, James Glass2, Lifei Huang1, Jason Weston1, Luke Zettlemoyer1, Xinlei Chen1 , Zhuang Liu3, Saining Xie4, Wen-tau Yih1, Shang-Wen Li1 , Hu Xu1 1FAIR, Meta 2MIT 3Princeton University 4New York University
Pseudocode Yes Algorithm 1: Pseudo-code of Meta CLIP 2 Curation Algorithm in Python/Num Py.
Open Source Code Yes Code and model are available at https://github.com/facebookresearch/Meta CLIP.
Open Datasets Yes Following Meta CLIP pipeline, we collect 5B image-text pairs sourced from the Internet that are publicly available. (...) Meta CLIP Code: License: CC-BY-NC URL: https://github.com/facebookresearch/Meta CLIP Wikipedia Dumps: License: CC BY-SA 4.0 URL: https://dumps.wikimedia.org/ Word Net: License: Word Net License 3.0 URL: https://wordnet.princeton.edu/ Multilingual Word Net: License: Word Net License 3.0 URL: https://omwn.org/ CLIP benchmark, including Flickr30k-200, XTD-10, XTD-200 * License: MIT * URL: https://github.com/LAION-AI/CLIP_benchmark
Dataset Splits Yes 1) English-only benchmarks on Image Net (IN val) [42], SLIP 26 tasks (SLIP 26 avg.) [35], and Data Comp 37 tasks (DC 37 avg.) [28]; 2) multilingual benchmarks on Babel Image Net (Babel-IN) [43] (averaged zero-shot classification on IN with classes and prompts translated into 280 languages), XM3600 [44] (multilingual text-to-image, T I, and image-to-text, I T, retrieval with an averaged recall@1 on 36 languages), CVQA [45] (multilingual multi-choice visual question answering with English and local averaged answer accuracy), Flickr30k-200 [46] (Flickr30k test set translated into 200 languages), XTD-10 [47] (multilingual image-text retrieval on MSCOCO [48] averaged Recall@1 over 7 languages), and XTD-200 [46] (XTD10 translated into 200 languages). In Table 1, we observe that Meta CLIP 2 on Vi T-H/14 with worldwide data and scaled seen pairs consistently outperforms its counterparts English (1.0 ) and Non-English (1.3 ), on both English and multilingual tasks, effectively breaking the curse of multilinguality . The curse still exists in non-scaled seen pairs, Worldwide (1.0 ) or smaller Vi T-L/14 model even with Worldwide (2.3 )). We further provide gradient conflict analysis to help understand the root of the curse in Appendix C.
Hardware Specification No No specific hardware details like GPU model, CPU model, or memory are provided. The paper mentions "larger models that take weeks to train, i.e. Vi T-L/14 and Vi T-H/14" but this is not a hardware specification.
Software Dependencies No Algorithm 1: Pseudo-code of Meta CLIP 2 Curation Algorithm in Python/Num Py. (...) We clean the corpora into plain text with Wiki Extractor [39]. (...) For languages without space separation (e.g., some Asian languages), we use open-source tokenizers (see Table 7 in Appendix A.1) developed by local communities to properly split text into words and meanwhile maintain the semantic integrity. (...) We adopt the Aho-Corasick algorithm 6,7, which utilizes prefix trees (tries), for rapid substring matching. The matching speed is about 2k times faster than Meta CLIP s brute-force implementation, enabling matching with million-scale metadata.
Experiment Setup Yes Table 8: Hyperparameters of Open AI CLIP / Meta CLIP vs Meta CLIP 2. Hyperparameter Open AI CLIP / Meta CLIP Meta CLIP 2 Activation Function Quick GELU Quick GELU Seen Pairs 12.8B 29B (2.3 ) Batch Size 32768 75366 (2.3 ) Learning Rate 4.0e-4 (L/14, H/14) 4.0e-4 (H/14) Warm-up 2k 2k