Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Authors: Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, Wenhu Chen
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop a series of VLM2VEC models based on state-of-the-art VLMs, including Phi-3.5-V, LLa VA-1.6, and Qwen2-VL, and evaluate them on MMEB s benchmark. With Lo RA tuning, VL M2VE C achieves a 10% to 20% improvement over existing multimodal embedding models on MMEB s evaluation sets. Our findings reveal that VLMs are secretly strong embedding models. |
| Researcher Affiliation | Collaboration | 1University of Waterloo, 2Salesforce Research |
| Pseudocode | No | The paper describes the VLM2VEC framework and contrastive training mathematically in Section 3.1, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a project website link (https://tiger-ai-lab.github.io/VLM2Vec/), but it does not explicitly state that source code is released there, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We present MMEB (Massive Multimodal Embedding Benchmark), a comprehensive benchmark designed to evaluate multimodal embeddings across a diverse set of tasks. MMEB consists of 36 datasets organized into four meta-tasks: classification, visual question answering, retrieval, and visual grounding. Each task is reformulated as a ranking problem... Examples for each dataset in MMEB are provided in Tables 7, 8, 9 and 10. The diversity in MMEB makes it an ideal testbed for universal embeddings. Further details on dataset processing can be found in Section A.1. |
| Dataset Splits | Yes | MMEB is divided into 20 in-distribution datasets, which can be used for training, and 16 out-of-distribution datasets, reserved for evaluation. ... For the number of target candidates, a higher count could increase evaluation costs and hinder rapid model iteration, while a lower count might make the benchmark too simple and prone to saturation. To strike a balance between these extremes, we have chosen 1,000 candidates. ... For the 20 training datasets, we randomly select up to 100K data points. |
| Hardware Specification | Yes | All experiments were run on 8 H100 GPUs. |
| Software Dependencies | No | The paper mentions specific models and techniques used, such as Phi-3.5-V, LLa VA-1.6, Qwen2-VL, Lo RA, and Grad Cache, but it does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The temperature for the loss function is set to 0.02, with a batch size of 1,024, a maximum text length of 256 tokens, and 2K training steps. The Lo RA variant uses a rank of 8. For VLM2VEC leveraging Phi-3.5-V as the backbone, we configure the number of sub-image crops to 4. For VLM2VEC using LLa VA-1.6 and Qwen2-VL as the backbone, we resize the input images to a uniform resolution, employing two setups: a high-resolution configuration of 1344 1344 and a low-resolution configuration of 336 336. |