Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation
Authors: Kang Zhang, Trung X. Pham, Suyeon Lee, Axi Niu, Arda Senocak, Joon Son Chung
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging Un AV-100 benchmark. We train on the VGGSound dataset [1], which contains in-the-wild video clips from You Tube, with 182k for training and 15k for testing. For generalization, we also test on the Un AV-100 dataset [17], which includes 10,791 test videos with annotated sound events. Following prior works [14, 13, 12], we evaluate generation quality using Fréchet Distance (FD), Fréchet Audio Distance (FAD), Inception Score (IS), KL divergence, and audio-video alignment accuracy. |
| Researcher Affiliation | Academia | 1KAIST, South Korea 2NWPU, China 3UNIST, South Korea (zhangkang,trungpx,syl4356,joonson)@kaist.ac.kr EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methods and components (Flow-Based Denoising Transformer, Dual-Role Audio-Visual Encoder, Audio Model-Guidance) in narrative text and uses mathematical formulations (Eq. 1-7) along with a diagram (Figure 2) to illustrate the framework. No explicit pseudocode or algorithm block is present. |
| Open Source Code | Yes | Code is available at: https://github.com/pantheon5100/mgaudio |
| Open Datasets | Yes | We train on the VGGSound dataset [1], which contains in-the-wild video clips from You Tube, with 182k for training and 15k for testing. For generalization, we also test on the Un AV-100 dataset [17], which includes 10,791 test videos with annotated sound events. |
| Dataset Splits | Yes | We train on the VGGSound dataset [1], which contains in-the-wild video clips from You Tube, with 182k for training and 15k for testing. For generalization, we also test on the Un AV-100 dataset [17], which includes 10,791 test videos with annotated sound events. |
| Hardware Specification | Yes | All experiments are run on a single A100 (80GB). |
| Software Dependencies | No | The paper mentions using Adam W optimizer but does not specify version numbers for any key software components like Python, PyTorch, or CUDA libraries. |
| Experiment Setup | Yes | MGAudio is trained for 1.1M steps with a batch size of 64, learning rate of 1e-4, and guidance scale w = 1.45. For all experiment we use sampling step of 50 and CFG value of 1.45. All models are optimized using Adam W [43] with a weight decay of 0 and betas (0.9, 0.999). |