Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Authors: Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, Nanyun (Violet) Peng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	With MRAG-BENCH, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-BENCH is vision-centric.
Researcher Affiliation	Academia	1UCLA, 2Stanford University EMAIL
Pseudocode	No	The paper describes the methodology in narrative text and does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions 'https://mragbench.github.io' which is a project website for the MRAG-BENCH benchmark, but it does not explicitly state that the source code for the methodology described in the paper is provided there.
Open Datasets	Yes	MRAG-BENCH consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. The paper also lists the website 'https://mragbench.github.io'. Additionally, it states: 'To collect diverse image objects and knowledge that are not extensively represented in LVLMs memories (Zhang et al., 2024c), we considered three sources of data, Image Net (Russakovsky et al., 2015), Oxford Flowers102 (Nilsback & Zisserman, 2008), and Stanford Cars (Krause et al., 2013).' and 'For the [OTHERS] scenario, we source the data from the Geo DE dataset (Ramaswamy et al., 2023).'
Dataset Splits	No	The paper presents MRAG-BENCH as an evaluation benchmark consisting of 'Total questions 1,353' and 'Total number of images 16,130'. While these are the total statistics for the benchmark, the paper does not provide specific training, validation, or test splits for this dataset in the context of reproducing model training or evaluation splits beyond using the entire benchmark for testing pre-trained models.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory configurations.
Software Dependencies	No	The paper mentions using 'GPT-3.5-turbo to extract the multiple choice answer' but does not specify any other software dependencies with version numbers, such as programming languages or libraries used for implementation.
Experiment Setup	Yes	We evaluate 14 popular LVLMs on MRAG-BENCH, including 4 proprietary models and 10 open-sourced models that can accept multi-image inputs. We adopt default generation hyper-parameters selected by each model. CLIP retriever is consistently used across all models. Both Retrieved RAG and GT RAG employ top-5 image examples (except for the incomplete scenario, where a single example is intuitively sufficient). For simplicity, all our experiments used five retrieved or ground-truth image examples.