Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS
Authors: Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but it underperforms compared to a smaller CLIP retriever in cross-modal retrieval tasks due to the modality bias exhibited by MLLMs. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark MBEIR, which spans multiple domains and tasks, while also surpassing the stateof-the-art text retrieval model, NV-Embed-v1, on the MTEB retrieval benchmark. |
| Researcher Affiliation | Collaboration | 1 NVIDIA 2 University of Waterloo |
| Pseudocode | No | The paper describes methods like 'modality-aware hard negative mining' and 'continuous text-to-text retrieval fine-tuning' in Section 4.1 and 4.1.2 through narrative text, without presenting any structured pseudocode blocks or algorithms. |
| Open Source Code | Yes | We release the model weights at: https://huggingface.co/nvidia/MM-Embed. |
| Open Datasets | Yes | We evaluate models universal multimodal retrieval capabilities using the M-BEIR dataset (Wei et al., 2023), which is constructed from 10 datasets with 16 diverse multimodal retrieval tasks across 4 domains, as listed in Appendix Table 10.1https://huggingface.co/datasets/TIGER-Lab/M-BEIR |
| Dataset Splits | Yes | We train our models on the M-BEIR 1.1M training queries and evaluate their effectiveness on the 190K test queries. ... Appendix Table 10: M-BEIR dataset statistics. Task Dataset Domain # Query Train Dev Test ... # Candid. Train Dev Test |
| Hardware Specification | Yes | All fine-tuning is conducted on 8 80GB A100 GPUs. ... The latency is measured using one thread on a Linux machine with a 2.2 GHz Intel Xeon Silver 4210 CPU and NVIDIA RTX A6000 GPUs, respectively. |
| Software Dependencies | No | We implement our training and inference using Tevatron (Gao et al., 2023). While Tevatron is mentioned, no specific version number for this or any other software dependency is provided. |
| Experiment Setup | Yes | For LLa Va-Next backbone, we fine-tune the models for 2 epochs with a learning rate of 1e-4. ... We fine-tune the models with the batch size of 128 * 8 and 64 * 8 when using random and hard negatives, respectively. ... We set the maximum length for queries and documents to 128. During continuously fine-tuning on both M-BEIR and text-to-text retrieval training data, we set the maximum length for queries and documents to 128 and 512, respectively. |