Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Authors: Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok, Arda Senocak, Joon Son Chung, Tae-Hyun Oh
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation results on AVHBench reveal that current audio-visual LLMs are prone to both audio-driven and video-driven hallucinations. |
| Researcher Affiliation | Academia | Kim Sung-Bin1 Oh Hyun-Bin1 Lee Jung-Mok1 Arda Senocak2 Joon Son Chung2 Tae-Hyun Oh1,3,4 1Dept. of Electrical Engineering and 3Grad. School of Artificial Intelligence, POSTECH 2School of Electrical Engineering, KAIST 4School of Computing, KAIST |
| Pseudocode | No | The paper describes methods through narrative text and a pipeline diagram (Fig. 3) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Dataset: https://github.com/kaist-ami/AVHBench |
| Open Datasets | Yes | To address this, we repurpose existing datasets, namely VALOR (Chen et al., 2023b) and Audio Caps (Kim et al., 2019), leveraging their videos and annotations. |
| Dataset Splits | Yes | In total, our comprehensive test and validation sets comprise 1,106 real and 1,030 synthetic source videos. [...] This dataset contains 10,327 videos with 87,624 Qn A pairs, collected from the training split of the VALOR (Chen et al., 2023b) and Audiocaps (Kim et al., 2019) datasets. |
| Hardware Specification | Yes | We utilize 4 A6000 (48GB) for distributed training with batch size 32 per device and an initial learning rate (3e-5) and weight decay (0.05) for 1 epoch. |
| Software Dependencies | No | The paper mentions using 'mixed precision' (fp16 for multiplication and fp32 for addition) but does not provide specific software dependencies or library versions such as Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | We utilize 4 A6000 (48GB) for distributed training with batch size 32 per device and an initial learning rate (3e-5) and weight decay (0.05) for 1 epoch. [...] We set the rank and alpha value of Lo RA to 16 and 32, respectively. |