Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding

Authors: Chaeyoung Jung, Youngjoon Jang, Joon Son Chung

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that AVCD consistently outperforms existing decoding methods. Especially, on the AVHBench dataset, it improves accuracy by 2% for Video LLa MA2 and 7% for video-SALMONN, demonstrating strong robustness and generalizability. Our code is available at https://github.com/kaistmm/AVCD.
Researcher Affiliation	Academia	Chaeyoung Jung Youngjoon Jang Joon Son Chung Korea Advanced Institute of Science and Technology (KAIST)
Pseudocode	Yes	A full description of the AVCD algorithm is provided in Supp. E. Algorithm. 1 shows the overall AVCD algorithm.
Open Source Code	Yes	Our code is available at https://github.com/kaistmm/AVCD.
Open Datasets	Yes	Specifically, we evaluate AVCD on the AVHBench dataset [51], a benchmark specifically designed to assess hallucinations in audio-visual settings. When applied to two widely used AV-LLMs, AVCD achieves a 2% relative improvement in accuracy on Video LLa MA2 [11] and 7% on video-SALMONN [50]. We evaluate AV-LLMs on MUSIC-AVQA [29], which tests synchronized audio-visual reasoning with question-answering (QA) pairs derived from the MUSIC dataset. For video-LLMs, we use MSVD-QA [62], which involves questions about objects, actions, and events in short video clips, using the first 1,000 test examples. We also use Activity Net-QA [67], a more challenging benchmark requiring reasoning over long videos with complex temporal understanding. On the AVHBench [51] dataset, AVCD produces a smaller deviation from the original logits compared to VCD [28].
Dataset Splits	Yes	We treat the entire dataset as the test set and use the initially released portion as the validation set, which is employed to evaluate inference speed as reported in Figure 5. Since audio-video captioning in AVHBench follows a distinct evaluation protocol, we assess it separately. For video-LLMs, we use MSVD-QA [62], which involves questions about objects, actions, and events in short video clips, using the first 1,000 test examples.
Hardware Specification	Yes	We run all experiments on a machine equipped with an AMD EPYC 7513 32-core CPU and a single NVIDIA RTX A6000 GPU.
Software Dependencies	No	The paper does not explicitly list software dependencies with specific version numbers.
Experiment Setup	Yes	In our experiments, the dominance-aware attentive masking method is applied to all transformer layers except the final layer. Based on the attention map, we mask the locations with the top 50% highest values (refer to Table A.3 in the Supp. D for more details). Based on our analysis that the modality dominance between video and audio is relatively balanced (see Figure A.1 in the Supp. D), we set αv and αa to be equal. To determine their optimal values, we randomly select 100 samples from each dataset and vary the value from 0.5 to 3.0 in increments of 0.5. As a result, we set it to 2.5 for the AVHBench dataset, and to 0.5 for all other datasets. Furthermore, we set the entropy threshold to τ = 0.6 for entropy-guided adaptive decoding.