Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MHBench: Demystifying Motion Hallucination in VideoLLMs

Authors: Ming Kong, Xianzhou Zeng, Luyuan Chen, Yadong Li, Bo Yan, Qiang Zhu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on MHBench reveal that current state-of-the-art Video LLMs significantly suffer from motion hallucination, while the introduction of Motion CD can effectively mitigate this issue, achieving up to a 15.1% performance improvement. We conduct an empirical study on advanced Video LLMs using the MHBench dataset, confirming that existing Video LLMs significantly suffer from motion hallucination and demonstrating Motion CD can effectively mitigate this issue.
Researcher Affiliation	Collaboration	1Zhejiang University, 2Beijing Information Science and Technology University, 3Ant Group EMAIL; EMAIL; EMAIL
Pseudocode	No	The paper describes the Motion Contrastive Decoding (Motion CD) method and its Bidirectional Motion Elimination (BME) strategy in paragraph text and with a figure, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Benchmark, Code and Appendix https://github.com/xzhouzeng/MHBench
Open Datasets	Yes	To systematically evaluate motion hallucination in Video LLMs, we constructed a benchmark dataset called MHBench consists of 1,200 videos and 20 action categories... Benchmark, Code and Appendix https://github.com/xzhouzeng/MHBench. We first collected 100 videos containing single action contents from the validation set of something2something-v2 (Goyal et al. 2017).
Dataset Splits	No	The paper describes the construction of adversarial triplet types of videos (original/antonym/incomplete) for the MHBench dataset, which functions as an evaluation benchmark. It details the composition of the benchmark (1,200 high-quality videos, with each action category including 20 samples per defined action type), but does not specify explicit training/validation/test splits for models to be trained or evaluated on MHBench itself beyond it being an evaluation set.
Hardware Specification	No	No specific hardware details (e.g., GPU models, CPU models, memory) used for running the experiments or training the Motion CD method are provided in the paper.
Software Dependencies	No	The paper mentions LLM backbones like 'Mistral-7B (Jiang et al. 2023)' and 'Vicuna-7B v0 (Chiang et al. 2023)' but does not provide specific software library names with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA) that are needed to reproduce the experiment.
Experiment Setup	Yes	We select the optimal hyperparameter settings through a grid search on the Video Chat2-Mistral model and extend it to other models. Specifically, the noise intensity α=20 and the adaptive rationality constraint β=0.1.