Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

Authors: Kang Zhang, Trung X. Pham, Suyeon Lee, Axi Niu, Arda Senocak, Joon Son Chung

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging Un AV-100 benchmark. We train on the VGGSound dataset [1], which contains in-the-wild video clips from You Tube, with 182k for training and 15k for testing. For generalization, we also test on the Un AV-100 dataset [17], which includes 10,791 test videos with annotated sound events. Following prior works [14, 13, 12], we evaluate generation quality using Fréchet Distance (FD), Fréchet Audio Distance (FAD), Inception Score (IS), KL divergence, and audio-video alignment accuracy.
Researcher Affiliation Academia 1KAIST, South Korea 2NWPU, China 3UNIST, South Korea (zhangkang,trungpx,syl4356,joonson)@kaist.ac.kr EMAIL, EMAIL
Pseudocode No The paper describes its methods and components (Flow-Based Denoising Transformer, Dual-Role Audio-Visual Encoder, Audio Model-Guidance) in narrative text and uses mathematical formulations (Eq. 1-7) along with a diagram (Figure 2) to illustrate the framework. No explicit pseudocode or algorithm block is present.
Open Source Code Yes Code is available at: https://github.com/pantheon5100/mgaudio
Open Datasets Yes We train on the VGGSound dataset [1], which contains in-the-wild video clips from You Tube, with 182k for training and 15k for testing. For generalization, we also test on the Un AV-100 dataset [17], which includes 10,791 test videos with annotated sound events.
Dataset Splits Yes We train on the VGGSound dataset [1], which contains in-the-wild video clips from You Tube, with 182k for training and 15k for testing. For generalization, we also test on the Un AV-100 dataset [17], which includes 10,791 test videos with annotated sound events.
Hardware Specification Yes All experiments are run on a single A100 (80GB).
Software Dependencies No The paper mentions using Adam W optimizer but does not specify version numbers for any key software components like Python, PyTorch, or CUDA libraries.
Experiment Setup Yes MGAudio is trained for 1.1M steps with a batch size of 64, learning rate of 1e-4, and guidance scale w = 1.45. For all experiment we use sampling step of 50 and CFG value of 1.45. All models are optimized using Adam W [43] with a weight decay of 0 and betas (0.9, 0.999).