Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
Authors: Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, Nanyun (Violet) Peng
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With MRAG-BENCH, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-BENCH is vision-centric. |
| Researcher Affiliation | Academia | 1UCLA, 2Stanford University EMAIL |
| Pseudocode | No | The paper describes the methodology in narrative text and does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions 'https://mragbench.github.io' which is a project website for the MRAG-BENCH benchmark, but it does not explicitly state that the source code for the methodology described in the paper is provided there. |
| Open Datasets | Yes | MRAG-BENCH consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. The paper also lists the website 'https://mragbench.github.io'. Additionally, it states: 'To collect diverse image objects and knowledge that are not extensively represented in LVLMs memories (Zhang et al., 2024c), we considered three sources of data, Image Net (Russakovsky et al., 2015), Oxford Flowers102 (Nilsback & Zisserman, 2008), and Stanford Cars (Krause et al., 2013).' and 'For the [OTHERS] scenario, we source the data from the Geo DE dataset (Ramaswamy et al., 2023).' |
| Dataset Splits | No | The paper presents MRAG-BENCH as an evaluation benchmark consisting of 'Total questions 1,353' and 'Total number of images 16,130'. While these are the total statistics for the benchmark, the paper does not provide specific training, validation, or test splits for this dataset in the context of reproducing model training or evaluation splits beyond using the entire benchmark for testing pre-trained models. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory configurations. |
| Software Dependencies | No | The paper mentions using 'GPT-3.5-turbo to extract the multiple choice answer' but does not specify any other software dependencies with version numbers, such as programming languages or libraries used for implementation. |
| Experiment Setup | Yes | We evaluate 14 popular LVLMs on MRAG-BENCH, including 4 proprietary models and 10 open-sourced models that can accept multi-image inputs. We adopt default generation hyper-parameters selected by each model. CLIP retriever is consistently used across all models. Both Retrieved RAG and GT RAG employ top-5 image examples (except for the incomplete scenario, where a single example is intuitively sufficient). For simplicity, all our experiments used five retrieved or ground-truth image examples. |