Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Authors: Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Ziyan Jiang, Wang Zhu, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, Wenhu Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present MEGA-BENCH, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks... In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators... We evaluate a wide variety of frontier vision-language models on MEGA-BENCH to understand their capabilities across these dimensions. Section 4: EXPERIMENTS: We evaluate 19 VLMs with multi-image support on MEGA-BENCH. 4.1 describes the evaluated models and the evaluation pipeline. 4.2 presents the evaluation results with a fine-grained analytical breakdown. 4.3 provides analyses on the number of examples per task and error types.
Researcher Affiliation	Academia	Jiacheng Chen , Tianhao Liang , Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Wang Zhu, Ziyan Jiang, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu , Xiang Yue , Wenhu Chen MEGA-Bench Team * Core Contributors, Contributed equally. See the Author Contribution Statement for details. EMAIL; EMAIL
Pseudocode	No	The paper includes figures illustrating annotation formats (Figure 8) and prompt templates (Figures 11, 12) which present structured information. However, none of these are explicitly labeled as "Pseudocode" or "Algorithm", nor do they represent a proposed algorithm's structured steps.
Open Source Code	No	We created a private Git Hub repository for constructing MEGA-BENCH. The repository’s main branch is protected, and all task submissions must go through pull requests (PRs). In our project page, we will provide a similar visualization page for users to interactively inspect the behaviors of different VLMs.
Open Datasets	Yes	We present MEGA-BENCH, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks... In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Table 18, we list data source details for every task in our benchmark.
Dataset Splits	Yes	The Core Set is evaluated with rule-based metrics to make the evaluation fast and cost-free. The Open-Ended Set is evaluated with metrics that use an LLM-as-a-judge... The Core and Open-Ended sets contain 440 and 65 tasks, respectively.
Hardware Specification	No	We also express our gratitude to Tung Vu from Green Node for providing access to GPUs, which were instrumental in running some of the evaluation experiments.
Software Dependencies	No	The paper mentions using GPT-4o-0806 as an LLM judge, and discusses using a Git Hub repository and an annotation GUI tool. However, it does not specify version numbers for any software libraries, frameworks, or programming languages used in their methodology.
Experiment Setup	Yes	For each query, we fill in a pre-defined prompt template with the task instructions written by the task annotators, the 1-shot example, and the concrete query question. Since this one-shot example’s primary purpose is to illustrate the output format, we allocate it a tiny portion of the total image budget. For each model, we conduct experiments with and without Chain-of-Thought (Co T) prompting (Wei et al., 2022) for the Core tasks. Full evaluation details are in D. Table 6: The maximum number of images and the budget for the in-context example per model. For images or video frames with a longer side larger than 1000 pixels, we resize the longer side to 1000 without changing the aspect ratio before sending them to the evaluated model.