Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Authors: Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Tianyi Zhou, Dinesh Manocha, Jordan Boyd-Graber

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate several leading VLMs, including Qwen-2.5-VL, Video-R1, and Video Chat-R1. Despite their strong performance on real-world benchmarks (e.g., MVBench, MMVU), these models hallucinate or fail to detect physical or logical violations, revealing fundamental weaknesses in visual understanding. Finally, we explore reinforcement learning based post-training on our negative dataset: fine-tuning improves performance on Video Hallu without degrading results on standard benchmarks indicating enhanced visual reasoning in VLMs.
Researcher Affiliation	Academia	University of Maryland, College Park University of Southern California EMAIL, EMAIL
Pseudocode	No	The paper describes methods like SFT and GRPO using mathematical formulations (Equation 1 and the GRPO objective function) and textual descriptions, but does not present these as structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our data is available at https://github.com/zli12321/Video Hallu.git. We will release our code and the dataset we use, with sufficient instructions.
Open Datasets	Yes	We introduce Video Hallu, a synthetic video dataset featuring physics- and commonsense-violating scenes generated using state-of-the-art tools such as Veo2, Sora, and Kling. The dataset includes expert-annotated question answer pairs spanning four categories of physical and commonsense violations, designed to be straightforward for human reasoning. Our data is available at https://github.com/zli12321/Video Hallu.git.
Dataset Splits	Yes	Our dataset comprises 3,233 video question answer pairs with no video overlap across splits: 800 pairs for training, 908 for validation, and 1,525 for testing.
Hardware Specification	Yes	To keep a fair comparison across different finetuning methods while reducing the training resources needed, we use 15 frames during training with learning rate 1e 6 to train the model for one epoch using the Open-R1 [38] framework on eight A100 80G GPUs.
Software Dependencies	No	The paper mentions using the 'Open-R1 [38] framework' but does not provide specific version numbers for this or any other software dependencies such as Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup	Yes	To keep a fair comparison across different finetuning methods while reducing the training resources needed, we use 15 frames during training with learning rate 1e 6 to train the model for one epoch using the Open-R1 [38] framework on eight A100 80G GPUs.