Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MHBench: Demystifying Motion Hallucination in VideoLLMs
Authors: Ming Kong, Xianzhou Zeng, Luyuan Chen, Yadong Li, Bo Yan, Qiang Zhu
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on MHBench reveal that current state-of-the-art Video LLMs significantly suffer from motion hallucination, while the introduction of Motion CD can effectively mitigate this issue, achieving up to a 15.1% performance improvement. We conduct an empirical study on advanced Video LLMs using the MHBench dataset, confirming that existing Video LLMs significantly suffer from motion hallucination and demonstrating Motion CD can effectively mitigate this issue. |
| Researcher Affiliation | Collaboration | 1Zhejiang University, 2Beijing Information Science and Technology University, 3Ant Group EMAIL; EMAIL; EMAIL |
| Pseudocode | No | The paper describes the Motion Contrastive Decoding (Motion CD) method and its Bidirectional Motion Elimination (BME) strategy in paragraph text and with a figure, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Benchmark, Code and Appendix https://github.com/xzhouzeng/MHBench |
| Open Datasets | Yes | To systematically evaluate motion hallucination in Video LLMs, we constructed a benchmark dataset called MHBench consists of 1,200 videos and 20 action categories... Benchmark, Code and Appendix https://github.com/xzhouzeng/MHBench. We first collected 100 videos containing single action contents from the validation set of something2something-v2 (Goyal et al. 2017). |
| Dataset Splits | No | The paper describes the construction of adversarial triplet types of videos (original/antonym/incomplete) for the MHBench dataset, which functions as an evaluation benchmark. It details the composition of the benchmark (1,200 high-quality videos, with each action category including 20 samples per defined action type), but does not specify explicit training/validation/test splits for models to be trained or evaluated on MHBench itself beyond it being an evaluation set. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU models, memory) used for running the experiments or training the Motion CD method are provided in the paper. |
| Software Dependencies | No | The paper mentions LLM backbones like 'Mistral-7B (Jiang et al. 2023)' and 'Vicuna-7B v0 (Chiang et al. 2023)' but does not provide specific software library names with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA) that are needed to reproduce the experiment. |
| Experiment Setup | Yes | We select the optimal hyperparameter settings through a grid search on the Video Chat2-Mistral model and extend it to other models. Specifically, the noise intensity α=20 and the adaptive rationality constraint β=0.1. |