Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering

Authors: JIANFENG CAI, Jiale Hong, Zongmeng Zhang, Wengang Zhou, zhannianji, Houqiang Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across multiple models and benchmarks demonstrate that our method markedly reduces hallucination in Video LLMs, thereby validating the robustness of our findings. Our experiments demonstrate the effectiveness of our method, yielding consistent gains across multiple benchmarks (e.g., up to +5.52% on Vid Halluc [41] and up to +24.21% on Event Hallusion [82]) and across diverse models, including Video-LLa VA [47], Video LLa MA2 [16], and Qwen2.5-VL [6].
Researcher Affiliation Collaboration University of Science and Technology of China Shanghai Jiaotong University Merchants Union Consumer Finance Company Limited EMAIL
Pseudocode No The paper describes methods and processes using textual descriptions and diagrams (Figure 4, Figure 5) but does not include structured pseudocode or algorithm blocks with explicit labels like "Algorithm" or "Pseudocode".
Open Source Code Yes Code and dataset are available at https://github.com/cai-jianfeng/TA-AE and https:// huggingface.co/datasets/caijanfeng/TA-AE
Open Datasets Yes Code and dataset are available at https://github.com/cai-jianfeng/TA-AE and https:// huggingface.co/datasets/caijanfeng/TA-AE. Specifically, we refrain from reusing the benchmark sources employed in Section 4.2 and instead adopt the Share GPT4Video dataset [13] as our base collection... We assess the effectiveness of our method on two representative benchmarks. (a) Vid Halluc [41]... (b) Event Hallusion [82]...
Dataset Splits Yes To determine the temporal variation characteristics of a given (m, q) pair, we train a temporal variation classifier θ on Df a and Df t . We randomly sample 400 instances from each dataset as the training set and use the remaining samples for validation. For each attention head j, we train a binary classifier using the training set to determine whether the input vector v originates from a hallucination-inducing prompt. Finally, we evaluate each classifier on the corresponding validation set.
Hardware Specification Yes All experiments are conducted on a single machine with 8 NVIDIA A100 80 GB GPUs. To ensure fairness, each model was tested on a single 3090 GPU, eliminating any additional overhead from distributed inference.
Software Dependencies No The paper does not explicitly state the version numbers for key software dependencies such as specific machine learning frameworks (e.g., PyTorch, TensorFlow) or other libraries used in the implementation.
Experiment Setup Yes The main hyperparameters include K, the number of top-ranked attention heads selected, and α, the scaling factor applied to the offset vectors. The search space is defined as the Cartesian product: {32, 64, 128, 256} {8, 16, 24, 32}... For TCD, we tune the frame downsampling rate r, as well as the contrastive decoding parameters α and β, over the space: {2, 4, 8} {0.25, 0.5, 0.75, 1.0} {0.1, 0.5}... The learning rate is set to 1 10 5, followed by a cosine learning rate schedule with an initial warmup of 10 steps and a batch size of 8. The classifier is trained for 5 epochs.