Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
Authors: Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Ziyan Jiang, Wang Zhu, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, Wenhu Chen
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present MEGA-BENCH, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks... In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators... We evaluate a wide variety of frontier vision-language models on MEGA-BENCH to understand their capabilities across these dimensions. Section 4: EXPERIMENTS: We evaluate 19 VLMs with multi-image support on MEGA-BENCH. 4.1 describes the evaluated models and the evaluation pipeline. 4.2 presents the evaluation results with a fine-grained analytical breakdown. 4.3 provides analyses on the number of examples per task and error types. |
| Researcher Affiliation | Academia | Jiacheng Chen , Tianhao Liang , Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Wang Zhu, Ziyan Jiang, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu , Xiang Yue , Wenhu Chen MEGA-Bench Team * Core Contributors, Contributed equally. See the Author Contribution Statement for details. EMAIL; EMAIL |
| Pseudocode | No | The paper includes figures illustrating annotation formats (Figure 8) and prompt templates (Figures 11, 12) which present structured information. However, none of these are explicitly labeled as "Pseudocode" or "Algorithm", nor do they represent a proposed algorithm's structured steps. |
| Open Source Code | No | We created a private Git Hub repository for constructing MEGA-BENCH. The repository’s main branch is protected, and all task submissions must go through pull requests (PRs). In our project page, we will provide a similar visualization page for users to interactively inspect the behaviors of different VLMs. |
| Open Datasets | Yes | We present MEGA-BENCH, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks... In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Table 18, we list data source details for every task in our benchmark. |
| Dataset Splits | Yes | The Core Set is evaluated with rule-based metrics to make the evaluation fast and cost-free. The Open-Ended Set is evaluated with metrics that use an LLM-as-a-judge... The Core and Open-Ended sets contain 440 and 65 tasks, respectively. |
| Hardware Specification | No | We also express our gratitude to Tung Vu from Green Node for providing access to GPUs, which were instrumental in running some of the evaluation experiments. |
| Software Dependencies | No | The paper mentions using GPT-4o-0806 as an LLM judge, and discusses using a Git Hub repository and an annotation GUI tool. However, it does not specify version numbers for any software libraries, frameworks, or programming languages used in their methodology. |
| Experiment Setup | Yes | For each query, we fill in a pre-defined prompt template with the task instructions written by the task annotators, the 1-shot example, and the concrete query question. Since this one-shot example’s primary purpose is to illustrate the output format, we allocate it a tiny portion of the total image budget. For each model, we conduct experiments with and without Chain-of-Thought (Co T) prompting (Wei et al., 2022) for the Core tasks. Full evaluation details are in D. Table 6: The maximum number of images and the budget for the in-context example per model. For images or video frames with a longer side larger than 1000 pixels, we resize the longer side to 1000 without changing the aspect ratio before sending them to the evaluated model. |