Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding
Authors: Chaeyoung Jung, Youngjoon Jang, Joon Son Chung
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that AVCD consistently outperforms existing decoding methods. Especially, on the AVHBench dataset, it improves accuracy by 2% for Video LLa MA2 and 7% for video-SALMONN, demonstrating strong robustness and generalizability. Our code is available at https://github.com/kaistmm/AVCD. |
| Researcher Affiliation | Academia | Chaeyoung Jung Youngjoon Jang Joon Son Chung Korea Advanced Institute of Science and Technology (KAIST) |
| Pseudocode | Yes | A full description of the AVCD algorithm is provided in Supp. E. Algorithm. 1 shows the overall AVCD algorithm. |
| Open Source Code | Yes | Our code is available at https://github.com/kaistmm/AVCD. |
| Open Datasets | Yes | Specifically, we evaluate AVCD on the AVHBench dataset [51], a benchmark specifically designed to assess hallucinations in audio-visual settings. When applied to two widely used AV-LLMs, AVCD achieves a 2% relative improvement in accuracy on Video LLa MA2 [11] and 7% on video-SALMONN [50]. We evaluate AV-LLMs on MUSIC-AVQA [29], which tests synchronized audio-visual reasoning with question-answering (QA) pairs derived from the MUSIC dataset. For video-LLMs, we use MSVD-QA [62], which involves questions about objects, actions, and events in short video clips, using the first 1,000 test examples. We also use Activity Net-QA [67], a more challenging benchmark requiring reasoning over long videos with complex temporal understanding. On the AVHBench [51] dataset, AVCD produces a smaller deviation from the original logits compared to VCD [28]. |
| Dataset Splits | Yes | We treat the entire dataset as the test set and use the initially released portion as the validation set, which is employed to evaluate inference speed as reported in Figure 5. Since audio-video captioning in AVHBench follows a distinct evaluation protocol, we assess it separately. For video-LLMs, we use MSVD-QA [62], which involves questions about objects, actions, and events in short video clips, using the first 1,000 test examples. |
| Hardware Specification | Yes | We run all experiments on a machine equipped with an AMD EPYC 7513 32-core CPU and a single NVIDIA RTX A6000 GPU. |
| Software Dependencies | No | The paper does not explicitly list software dependencies with specific version numbers. |
| Experiment Setup | Yes | In our experiments, the dominance-aware attentive masking method is applied to all transformer layers except the final layer. Based on the attention map, we mask the locations with the top 50% highest values (refer to Table A.3 in the Supp. D for more details). Based on our analysis that the modality dominance between video and audio is relatively balanced (see Figure A.1 in the Supp. D), we set αv and αa to be equal. To determine their optimal values, we randomly select 100 samples from each dataset and vary the value from 0.5 to 3.0 in increments of 0.5. As a result, we set it to 2.5 for the AVHBench dataset, and to 0.5 for all other datasets. Furthermore, we set the entropy threshold to τ = 0.6 for entropy-guided adaptive decoding. |