Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Authors: Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Tianshuo Yang, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. Our evaluation of nearly 30 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension, particularly in tasks involving spatial understanding. Through multi-faceted analytical experiments, we identify key performance gaps and limitations, providing valuable insights for future model and data improvements. We conduct ablation studies to investigate the impact of unanswerable questions and multi-image concatenation methods on model performance.
Researcher Affiliation	Collaboration	1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3The University of Hong Kong 4Sense Time Research 5Tsinghua University
Pseudocode	No	The paper describes methods and processes in text but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	We release the data and code at https://github.com/Open GVLab/MMIU. First, we introduce and open-source the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite that addresses various complex multi-image tasks, thereby filling a critical gap in multi-image comprehension.
Open Datasets	Yes	To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. We release the data and code at https://github.com/Open GVLab/MMIU. Table 5: Task descriptions and corresponding datasets for multi-image tasks in temporal relationships [includes datasets such as Kinetics (Kay et al., 2017), MSVD (Chen & Dolan, 2011), MSRVTT (Xu et al., 2016)]. Table 6: Task descriptions and corresponding datasets for multi-image tasks in spatial relationships [includes datasets such as MSCOCO (Lin et al., 2014), Scan Net (Dai et al., 2017)]. Table 7: Task descriptions and corresponding datasets for multi-image tasks in semantic relationships [includes datasets such as places365 (Zhou et al., 2017), LFW (Huang et al., 2008)].
Dataset Splits	Yes	We provide two versions of MMIU: test and testmini, with the latter being 1/10th the size of the former for quick testing, which has 1040 samples. In this paper, we primarily conduct experiments and analyses on the test set, while results on testmini are recorded in the Table 15 in the appendix for comparison. For efficient evaluation, we limit each task to a maximum of 200 randomly selected samples, except for tasks with insufficient data.
Hardware Specification	Yes	As shown in Table 23, our benchmark enables efficient evaluation across all tested models, regardless of their size or complexity. The results show that most models can be tested on a single A100 card within 1 hour; even the closed-source model GPT-4o can be tested within 1 hour with only 8.57 USD.
Software Dependencies	No	The paper mentions tools like "Open Compass (Contributors, 2023)" and models such as "GPT-4o (Open AI, 2024)" but does not specify version numbers for programming languages, libraries, or frameworks used in their own methodology.
Experiment Setup	No	Section 4.1 "EXPERIMENT SETUP" describes the evaluation method, including how model responses are matched, handling long input tokens, discarding samples due to copyright issues for closed-source models, and shuffling options to prevent bias. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or other system-level training settings for the models or the evaluation process itself.