Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Authors: Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Tianshuo Yang, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. Our evaluation of nearly 30 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension, particularly in tasks involving spatial understanding. Through multi-faceted analytical experiments, we identify key performance gaps and limitations, providing valuable insights for future model and data improvements. We conduct ablation studies to investigate the impact of unanswerable questions and multi-image concatenation methods on model performance.
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3The University of Hong Kong 4Sense Time Research 5Tsinghua University
Pseudocode No The paper describes methods and processes in text but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes We release the data and code at https://github.com/Open GVLab/MMIU. First, we introduce and open-source the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite that addresses various complex multi-image tasks, thereby filling a critical gap in multi-image comprehension.
Open Datasets Yes To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. We release the data and code at https://github.com/Open GVLab/MMIU. Table 5: Task descriptions and corresponding datasets for multi-image tasks in temporal relationships [includes datasets such as Kinetics (Kay et al., 2017), MSVD (Chen & Dolan, 2011), MSRVTT (Xu et al., 2016)]. Table 6: Task descriptions and corresponding datasets for multi-image tasks in spatial relationships [includes datasets such as MSCOCO (Lin et al., 2014), Scan Net (Dai et al., 2017)]. Table 7: Task descriptions and corresponding datasets for multi-image tasks in semantic relationships [includes datasets such as places365 (Zhou et al., 2017), LFW (Huang et al., 2008)].
Dataset Splits Yes We provide two versions of MMIU: test and testmini, with the latter being 1/10th the size of the former for quick testing, which has 1040 samples. In this paper, we primarily conduct experiments and analyses on the test set, while results on testmini are recorded in the Table 15 in the appendix for comparison. For efficient evaluation, we limit each task to a maximum of 200 randomly selected samples, except for tasks with insufficient data.
Hardware Specification Yes As shown in Table 23, our benchmark enables efficient evaluation across all tested models, regardless of their size or complexity. The results show that most models can be tested on a single A100 card within 1 hour; even the closed-source model GPT-4o can be tested within 1 hour with only 8.57 USD.
Software Dependencies No The paper mentions tools like "Open Compass (Contributors, 2023)" and models such as "GPT-4o (Open AI, 2024)" but does not specify version numbers for programming languages, libraries, or frameworks used in their own methodology.
Experiment Setup No Section 4.1 "EXPERIMENT SETUP" describes the evaluation method, including how model responses are matched, handling long input tokens, discarding samples due to copyright issues for closed-source models, and shuffling options to prevent bias. However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or other system-level training settings for the models or the evaluation process itself.