Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MR. Video: MapReduce as an Effective Principle for Long Video Understanding

Authors: Ziqi Pang, Yu-Xiong Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our MR. Video achieves a >7% accuracy improvement on the challenging LVBench over state-of-the-art video agents and vision-language models (VLMs) and demonstrates a clear advantage on multiple long video benchmarks, highlighting the potential of the Map Reduce principle. The code is at https://github.com/ziqipang/MR-Video.
Researcher Affiliation	Academia	Ziqi Pang Yu-Xiong Wang University of Illinois Urbana-Champaign EMAIL
Pseudocode	No	The paper describes the MR. Video framework and its components (Captioning stage, Analysis stage) using figures (Fig. 1, Fig. 2, Fig. 3, Fig. 4, Fig. 5) and textual descriptions of steps, but it does not present any formal pseudocode blocks or algorithms.
Open Source Code	Yes	The code is at https://github.com/ziqipang/MR-Video.
Open Datasets	Yes	To validate the Map Reduce principle within our limited budget, we focus on the challenging long video benchmark: LVBench [39]. Compared with others [9, 31, 35, 63], LVBench features more extremely long video durations and challenging questions... We expand the breadth of evaluation using the subsets of other representative video understanding benchmarks, especially the long video parts of Long Video Bench [47], Video-MME [9], and Ego Shema [31].
Dataset Splits	Yes	For the ablation study (Sec. 4.4), we form a subset to save the budget by selecting the first video of each video category in LVBench. This subset has 6 videos and 98 questions in total. For additional evaluation, we use (1) the longest subset of Long Video Bench s validation set, (2) the long video subset of Video MME without subtitles, and (3) the validation set of Ego Schema.
Hardware Specification	Yes	This work used computational resources, including the NCSA Delta and Delta AI supercomputers through allocations CIS230012, CIS240133, and CIS240387 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, as well as the TACC Frontera supercomputer, Amazon Web Services (AWS), and Open AI API through the National Artificial Intelligence Research Resource (NAIRR) Pilot.
Software Dependencies	No	We utilize Gemini-2.0-Flash [37] as our VLM, and we only use GPT4o to process texts. While specific models are named, the paper does not provide version numbers for ancillary software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used for implementation beyond the APIs.
Experiment Setup	Yes	To save our expenses, we utilize Gemini-2.0-Flash [37] as our VLM, and we only use GPT4o to process texts. On average, generating the dense captions for an hour-long video requires approximately $0.8 of Gemini-2.0-Flash, and answering each question from LVbench costs $0.4 GPT4o on average. We provide further details, especially the prompts, in Sec. D. Controlled Context Lengths. We highlight a vital implementation detail so that our video agent is meaningful for overcoming the context length challenges: we explicitly control the VLM to perceive less than 40 frames per query, significantly less than the typical 256 or even more frames for long video VLMs [22]. This ensures MR. Video does not violate the motivation of building video agents.