Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Authors: Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, Nanyun (Violet) Peng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With MRAG-BENCH, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-BENCH is vision-centric.
Researcher Affiliation Academia 1UCLA, 2Stanford University EMAIL
Pseudocode No The paper describes the methodology in narrative text and does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions 'https://mragbench.github.io' which is a project website for the MRAG-BENCH benchmark, but it does not explicitly state that the source code for the methodology described in the paper is provided there.
Open Datasets Yes MRAG-BENCH consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. The paper also lists the website 'https://mragbench.github.io'. Additionally, it states: 'To collect diverse image objects and knowledge that are not extensively represented in LVLMs memories (Zhang et al., 2024c), we considered three sources of data, Image Net (Russakovsky et al., 2015), Oxford Flowers102 (Nilsback & Zisserman, 2008), and Stanford Cars (Krause et al., 2013).' and 'For the [OTHERS] scenario, we source the data from the Geo DE dataset (Ramaswamy et al., 2023).'
Dataset Splits No The paper presents MRAG-BENCH as an evaluation benchmark consisting of 'Total questions 1,353' and 'Total number of images 16,130'. While these are the total statistics for the benchmark, the paper does not provide specific training, validation, or test splits for this dataset in the context of reproducing model training or evaluation splits beyond using the entire benchmark for testing pre-trained models.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory configurations.
Software Dependencies No The paper mentions using 'GPT-3.5-turbo to extract the multiple choice answer' but does not specify any other software dependencies with version numbers, such as programming languages or libraries used for implementation.
Experiment Setup Yes We evaluate 14 popular LVLMs on MRAG-BENCH, including 4 proprietary models and 10 open-sourced models that can accept multi-image inputs. We adopt default generation hyper-parameters selected by each model. CLIP retriever is consistently used across all models. Both Retrieved RAG and GT RAG employ top-5 image examples (except for the incomplete scenario, where a single example is intuitively sufficient). For simplicity, all our experiments used five retrieved or ground-truth image examples.