Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment
Authors: Yuqin Cao, Xiongkuo Min, Yixuan Gao, Wei Sun, Guangtao Zhai
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address this, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising 3, 382 AGAVs from 16 VTA methods. ... Our experimental results demonstrate that AGAV-Rater achieves state-of-the-art performance on three quality assessment datasets: AGAVQA-MOS, text-to-audio (TTA), and text-to-music (TTM) (Deshmukh et al., 2024). Subjective tests also confirm that AGAV-Rater enhances VTA performance and user experience. |
| Researcher Affiliation | Academia | 1Institute of Image Communication and Network Engineering, Shanghai Key Laboratory of Digital Media Processing and Transmissions, Shanghai Jiao Tong University, Shanghai 2School of Communication & Electronic Engineering, East China Normal University, Shanghai. Correspondence to: Guangtao Zhai <EMAIL>, Xiongkuo Min <EMAIL>. |
| Pseudocode | No | The paper describes the model architecture and training process with figures (e.g., Figure 3), but it does not include a dedicated pseudocode block or algorithm section. |
| Open Source Code | Yes | The dataset and code are available at https://github.com/charlotte9524/AGAVRater. |
| Open Datasets | Yes | To address this, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising 3, 382 AGAVs from 16 VTA methods. ... The dataset and code are available at https://github.com/charlotte9524/AGAVRater. ... we create 50, 952 instruction-response pairs related to the perceived quality from 3 large-scale real-world audio-caption datasets, including audio-visual datasets VGGSound (Chen et al., 2020), audio captioning dataset Audio Caps (Kim et al., 2019), and music captioning dataset Music Caps (Agostinelli et al., 2023). |
| Dataset Splits | Yes | All experiments for each method are retrained on the AGAVQA-MOS subset using 5-fold cross-validation. |
| Hardware Specification | Yes | The AGAV-Rater model is implemented with Py Torch and trained on two 96GB H20 GPUs. ... Fine-tuning the AGAV-Rater model on the AGAVQA-MOS subset for 5 epochs using two 96GB H20 GPUs takes approximately 5 hours. ... In Tab. 9, we report the inference latency of AGAV-Rater on AGAVs. On a single RTX 4090 GPU, the model can predict scores for 6.36 videos of 3 seconds or 3.01 videos of 12 seconds per second. |
| Software Dependencies | No | The AGAV-Rater model is implemented with Py Torch and trained on two 96GB H20 GPUs. The learning rate is set to 1e 5, and the batch size is set to 9. |
| Experiment Setup | Yes | The learning rate is set to 1e 5, and the batch size is set to 9. During pre-training, the number of training epochs is set to 1, and optimization is performed. For fine-tuning, the number of training epochs is set to 5 on the AGAVQA-MOS subset and 10 on the TTA and TTM datasets (Deshmukh et al., 2024). |