Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment

Authors: Yuqin Cao, Xiongkuo Min, Yixuan Gao, Wei Sun, Guangtao Zhai

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising 3, 382 AGAVs from 16 VTA methods. ... Our experimental results demonstrate that AGAV-Rater achieves state-of-the-art performance on three quality assessment datasets: AGAVQA-MOS, text-to-audio (TTA), and text-to-music (TTM) (Deshmukh et al., 2024). Subjective tests also confirm that AGAV-Rater enhances VTA performance and user experience.
Researcher Affiliation	Academia	1Institute of Image Communication and Network Engineering, Shanghai Key Laboratory of Digital Media Processing and Transmissions, Shanghai Jiao Tong University, Shanghai 2School of Communication & Electronic Engineering, East China Normal University, Shanghai. Correspondence to: Guangtao Zhai <EMAIL>, Xiongkuo Min <EMAIL>.
Pseudocode	No	The paper describes the model architecture and training process with figures (e.g., Figure 3), but it does not include a dedicated pseudocode block or algorithm section.
Open Source Code	Yes	The dataset and code are available at https://github.com/charlotte9524/AGAVRater.
Open Datasets	Yes	To address this, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising 3, 382 AGAVs from 16 VTA methods. ... The dataset and code are available at https://github.com/charlotte9524/AGAVRater. ... we create 50, 952 instruction-response pairs related to the perceived quality from 3 large-scale real-world audio-caption datasets, including audio-visual datasets VGGSound (Chen et al., 2020), audio captioning dataset Audio Caps (Kim et al., 2019), and music captioning dataset Music Caps (Agostinelli et al., 2023).
Dataset Splits	Yes	All experiments for each method are retrained on the AGAVQA-MOS subset using 5-fold cross-validation.
Hardware Specification	Yes	The AGAV-Rater model is implemented with Py Torch and trained on two 96GB H20 GPUs. ... Fine-tuning the AGAV-Rater model on the AGAVQA-MOS subset for 5 epochs using two 96GB H20 GPUs takes approximately 5 hours. ... In Tab. 9, we report the inference latency of AGAV-Rater on AGAVs. On a single RTX 4090 GPU, the model can predict scores for 6.36 videos of 3 seconds or 3.01 videos of 12 seconds per second.
Software Dependencies	No	The AGAV-Rater model is implemented with Py Torch and trained on two 96GB H20 GPUs. The learning rate is set to 1e 5, and the batch size is set to 9.
Experiment Setup	Yes	The learning rate is set to 1e 5, and the batch size is set to 9. During pre-training, the number of training epochs is set to 1, and optimization is performed. For fine-tuning, the number of training epochs is set to 5 on the AGAVQA-MOS subset and 10 on the TTA and TTM datasets (Deshmukh et al., 2024).