Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

Authors: Kang Zhang, Trung X. Pham, Suyeon Lee, Axi Niu, Arda Senocak, Joon Son Chung

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging Un AV-100 benchmark. We train on the VGGSound dataset [1], which contains in-the-wild video clips from You Tube, with 182k for training and 15k for testing. For generalization, we also test on the Un AV-100 dataset [17], which includes 10,791 test videos with annotated sound events. Following prior works [14, 13, 12], we evaluate generation quality using Fréchet Distance (FD), Fréchet Audio Distance (FAD), Inception Score (IS), KL divergence, and audio-video alignment accuracy.
Researcher Affiliation	Academia	1KAIST, South Korea 2NWPU, China 3UNIST, South Korea (zhangkang,trungpx,syl4356,joonson)@kaist.ac.kr EMAIL, EMAIL
Pseudocode	No	The paper describes its methods and components (Flow-Based Denoising Transformer, Dual-Role Audio-Visual Encoder, Audio Model-Guidance) in narrative text and uses mathematical formulations (Eq. 1-7) along with a diagram (Figure 2) to illustrate the framework. No explicit pseudocode or algorithm block is present.
Open Source Code	Yes	Code is available at: https://github.com/pantheon5100/mgaudio
Open Datasets	Yes	We train on the VGGSound dataset [1], which contains in-the-wild video clips from You Tube, with 182k for training and 15k for testing. For generalization, we also test on the Un AV-100 dataset [17], which includes 10,791 test videos with annotated sound events.
Dataset Splits	Yes	We train on the VGGSound dataset [1], which contains in-the-wild video clips from You Tube, with 182k for training and 15k for testing. For generalization, we also test on the Un AV-100 dataset [17], which includes 10,791 test videos with annotated sound events.
Hardware Specification	Yes	All experiments are run on a single A100 (80GB).
Software Dependencies	No	The paper mentions using Adam W optimizer but does not specify version numbers for any key software components like Python, PyTorch, or CUDA libraries.
Experiment Setup	Yes	MGAudio is trained for 1.1M steps with a batch size of 64, learning rate of 1e-4, and guidance scale w = 1.45. For all experiment we use sampling step of 50 and CFG value of 1.45. All models are optimized using Adam W [43] with a weight decay of 0 and betas (0.9, 0.999).