Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction

Authors: Yunkee Chae, Kyogu Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we outline our experimental protocol, including baseline models, datasets, and key implementation details. A comprehensive description of hyperparameters and training procedures is provided in Appendix D. All baseline results are re-evaluated using our test sets to ensure consistency with our experimental setup.
Researcher Affiliation Academia Yunkee Chae1 Kyogu Lee 1,2,3 Music and Audio Research Group (MARG) 1 Interdisciplinary Program in Artificial Intelligence (IPAI) 2 AIIS, 3 Department of Intelligence and Information Seoul National University EMAIL
Pseudocode Yes Algorithm 1 Inpainting using the Re Paint approach. Algorithm 2 Inpainting using adaptive timestep approach
Open Source Code Yes We released the full training code and all model checkpoints under an open-source license.
Open Datasets Yes We train and evaluate on three multi-track music datasets: Slakh2100 [37], MUSDB18 [38] (denoted Mu), and Moises DB [39] (denoted Mo).
Dataset Splits Yes Slakh2100... comprises 2100 songs divided into training (1500), validation (375), and test (225) splits... MUSDB18... use all 100 tracks from the official training split for training and the 50-track test split for evaluation... Moises DB... randomly sample 24 tracks (10%) as the test set and use the remaining tracks for training.
Hardware Specification Yes All models were trained on a single NVIDIA RTX 6000 GPU (48 GB of memory).
Software Dependencies Yes Text embeddings are obtained from CLAP [56] using the checkpoint music_audioset_epoch_15_esc_90.14.pt via the laion-clap library.5 Our implementation builds upon the official stable-audio-tools repository from Stability AI6 and the training framework from friendly-stable-audio-tools.7
Experiment Setup Yes All of our models, except the one trained on the full dataset combination (SA, SB, Mu, Mo), are trained for 200K iterations with a batch size of 64, using 16 k Hz audio segments of 10.24 seconds. The full combination model is trained for 320K iterations with a batch size of 128. During sampling and inpainting, we apply classifier-free guidance (CFG) with a guidance scale of 2.0 and a per-track dropout probability of p = 0.1. All diffusion-based samples including those from baseline models are generated using 250 inference steps.