Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

When Thinking Drifts: Evidential Grounding for Robust Video Reasoning

Authors: Romy Luo, Zihui (Sherry) Xue, Alex Dimakis, Kristen Grauman

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive evaluation across 10 diverse video understanding benchmarks demonstrates that our Video-VER consistently achieves top performance. Our work sheds light on the distinct challenges of video-centric reasoning and encourages the development of AI that robustly grounds its inferences in visual evidence for large multimodal models that not only think before answering", but also see while thinking". We evaluate our Video-VER model across 10 diverse video understanding benchmarks. Compared to strong base models and existing reasoning techniques, Video-VER consistently ranks first or second. Furthermore, our model achieves consistently strong margins compared to its respective base MLLM (trained without the Visual Evidence Reward) as much as +9.0% absolute accuracy gains, and an average of +4.0% across all 10 benchmarks.
Researcher Affiliation	Collaboration	1The University of Texas at Austin 2UC Berkeley 3Bespoke Labs
Pseudocode	No	The paper describes methods and mathematical formulations, but it does not include any explicit section or figure labeled "Pseudocode" or "Algorithm".
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code and data will be released.
Open Datasets	Yes	Benchmarks We extensively evaluate our model across a broad spectrum of 10 public video understanding benchmarks, covering a wide range of reasoning skills. These include comprehensive, all-around benchmarks such as MVBench [32] and Video-MME [18]; temporal reasoning benchmarks like TVBench [13], Vinoground [71], and Temp Compass [38]; spatial reasoning benchmarks such as VSI-Bench [63]; and knowledge-intensive datasets including Video-MMMU [24] and MMVU [76]. We also assess robustness to hallucination using dedicated benchmarks such as Event Hallusion [70] and Video Hallucer [57].
Dataset Splits	No	The paper mentions using specific datasets for training and evaluating on benchmarks. For instance: "The process begins with SFT, where we train the model on Video-R1-COT-165k dataset [17]... This phase uses a dataset mixture comprising Reversed-in-Time [14] and Video-R1-260k [17] samples." However, it does not explicitly state the training, validation, or test splits for these datasets, nor does it explicitly cite predefined splits for the benchmarks in a way that fully meets the criteria for reproduction.
Hardware Specification	Yes	We train our model with 8 NVIDIA H200 GPUs.
Software Dependencies	Yes	Our model is a post-trained Qwen2.5-VL-7B [4]... We utilize Llama-3.1-70B-Instruct [19] as our LLM-based judge.
Experiment Setup	Yes	During training, the maximum number of video frames is set as 16, and increased to 32 at inference time for both our model and all baselines, unless otherwise specified. For GRPO training, we incorporate four reward components: an accuracy reward, our visual evidence reward (with weight α = 0.3), a format reward to encourage consistent answer structure, and a length reward to promote moderately long, informative responses. We train our model with 8 NVIDIA H200 GPUs. GRPO group size G is set as 8. The number of RL iterations is set to 2,000.