Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Authors: Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Tianyi Zhou, Dinesh Manocha, Jordan Boyd-Graber

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate several leading VLMs, including Qwen-2.5-VL, Video-R1, and Video Chat-R1. Despite their strong performance on real-world benchmarks (e.g., MVBench, MMVU), these models hallucinate or fail to detect physical or logical violations, revealing fundamental weaknesses in visual understanding. Finally, we explore reinforcement learning based post-training on our negative dataset: fine-tuning improves performance on Video Hallu without degrading results on standard benchmarks indicating enhanced visual reasoning in VLMs.
Researcher Affiliation Academia University of Maryland, College Park University of Southern California EMAIL, EMAIL
Pseudocode No The paper describes methods like SFT and GRPO using mathematical formulations (Equation 1 and the GRPO objective function) and textual descriptions, but does not present these as structured pseudocode or algorithm blocks.
Open Source Code Yes Our data is available at https://github.com/zli12321/Video Hallu.git. We will release our code and the dataset we use, with sufficient instructions.
Open Datasets Yes We introduce Video Hallu, a synthetic video dataset featuring physics- and commonsense-violating scenes generated using state-of-the-art tools such as Veo2, Sora, and Kling. The dataset includes expert-annotated question answer pairs spanning four categories of physical and commonsense violations, designed to be straightforward for human reasoning. Our data is available at https://github.com/zli12321/Video Hallu.git.
Dataset Splits Yes Our dataset comprises 3,233 video question answer pairs with no video overlap across splits: 800 pairs for training, 908 for validation, and 1,525 for testing.
Hardware Specification Yes To keep a fair comparison across different finetuning methods while reducing the training resources needed, we use 15 frames during training with learning rate 1e 6 to train the model for one epoch using the Open-R1 [38] framework on eight A100 80G GPUs.
Software Dependencies No The paper mentions using the 'Open-R1 [38] framework' but does not provide specific version numbers for this or any other software dependencies such as Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup Yes To keep a fair comparison across different finetuning methods while reducing the training resources needed, we use 15 frames during training with learning rate 1e 6 to train the model for one epoch using the Open-R1 [38] framework on eight A100 80G GPUs.