Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS

Authors: Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but it underperforms compared to a smaller CLIP retriever in cross-modal retrieval tasks due to the modality bias exhibited by MLLMs. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark MBEIR, which spans multiple domains and tasks, while also surpassing the stateof-the-art text retrieval model, NV-Embed-v1, on the MTEB retrieval benchmark.
Researcher Affiliation Collaboration 1 NVIDIA 2 University of Waterloo
Pseudocode No The paper describes methods like 'modality-aware hard negative mining' and 'continuous text-to-text retrieval fine-tuning' in Section 4.1 and 4.1.2 through narrative text, without presenting any structured pseudocode blocks or algorithms.
Open Source Code Yes We release the model weights at: https://huggingface.co/nvidia/MM-Embed.
Open Datasets Yes We evaluate models universal multimodal retrieval capabilities using the M-BEIR dataset (Wei et al., 2023), which is constructed from 10 datasets with 16 diverse multimodal retrieval tasks across 4 domains, as listed in Appendix Table 10.1https://huggingface.co/datasets/TIGER-Lab/M-BEIR
Dataset Splits Yes We train our models on the M-BEIR 1.1M training queries and evaluate their effectiveness on the 190K test queries. ... Appendix Table 10: M-BEIR dataset statistics. Task Dataset Domain # Query Train Dev Test ... # Candid. Train Dev Test
Hardware Specification Yes All fine-tuning is conducted on 8 80GB A100 GPUs. ... The latency is measured using one thread on a Linux machine with a 2.2 GHz Intel Xeon Silver 4210 CPU and NVIDIA RTX A6000 GPUs, respectively.
Software Dependencies No We implement our training and inference using Tevatron (Gao et al., 2023). While Tevatron is mentioned, no specific version number for this or any other software dependency is provided.
Experiment Setup Yes For LLa Va-Next backbone, we fine-tune the models for 2 epochs with a learning rate of 1e-4. ... We fine-tune the models with the batch size of 128 * 8 and 64 * 8 when using random and hard negatives, respectively. ... We set the maximum length for queries and documents to 128. During continuously fine-tuning on both M-BEIR and text-to-text retrieval training data, we set the maximum length for queries and documents to 128 and 512, respectively.