Multi-Source Diffusion Models for Simultaneous Music Generation and Separation

Authors: Giorgio Mariani, Irene Tallini, Emilian Postolache, Michele Mancusi, Luca Cosmo, Emanuele Rodolà

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the source separation setting. and 5 EXPERIMENTAL RESULTS We experiment on Slakh2100 (Manilow et al., 2019), a standard dataset for music source separation.
Researcher Affiliation Academia Giorgio Mariani Sapienza University of Rome mariani@di.uniroma1.it Irene Tallini Sapienza University of Rome tallini@di.uniroma1.it Emilian Postolache Sapienza University of Rome postolache@di.uniroma1.it Michele Mancusi Sapienza University of Rome mancusi@di.uniroma1.it Ca Foscari University of Venice luca.cosmo@unive.it Emanuele Rodolà Sapienza University of Rome rodola@di.uniroma1.it
Pseudocode Yes Algorithm 1 MSDM Dirac sampler for source separation. Require: I number of discretization steps for the ODE, R number of corrector steps, {σi}i {0,...,I} noise schedule, Schurn 1: Initialize ˆx N(0, σ2 II) 2: α min(Schurn/I, 2 1) 3: for i I to 1 do 4: for r R to 0 do 5: ˆσ σi (α + 1) 6: ϵ N(0, I) 7: ˆx ˆx + p ˆσ2 σ2 i ϵ 8: z [ˆx1:N 1, y PN 1 n=1 ˆxn] 9: for n 1 to N 1 do 10: gn Sθ n(z, ˆσ) Sθ N(z, ˆσ) 11: end for 12: g [g1, . . . , g N 1] 13: ˆx1:N 1 ˆx1:N 1 + (σi 1 ˆσ)g 14: ˆx [ˆx1:N 1, y PN 1 n=1 ˆxn] 15: if r > 0 then 16: ϵ N(0, I) 17: ˆx ˆx + q σ2 i σ2 i 1ϵ 18: end if 19: end for 20: end for 21: return ˆx
Open Source Code Yes The implementation of the score network is based on a time domain (non-latent) unconditional version of Moûsai (Schneider et al., 2023). We used the publicly available repository audio-diffusion-pytorch/v0.0.4322.
Open Datasets Yes We experiment on Slakh2100 (Manilow et al., 2019), a standard dataset for music source separation. ... In Appendix E, we experiment on MUSDB18-HQ Rafii et al. (2019), a benchmark dataset for the music source separation task.
Dataset Splits Yes The dataset comprises 2100 tracks, with a distribution of 1500 tracks for training, 375 for validation, and 225 for testing. and It contains 150 tracks, with 100 allocated for training and 50 for testing, amounting to roughly 10 hours of professional-grade audio.
Hardware Specification Yes All our models were trained until convergence on an NVIDIA RTX A6000 GPU with 24 GB of VRAM.
Software Dependencies Yes We used the publicly available repository audio-diffusion-pytorch/v0.0.4322. ... We employ audio-diffusion-pytorch-trainer3 for training.
Experiment Setup Yes We downsample data to 22k Hz and train the score network with four stacked mono channels for MSDM (i.e., one for each stem) and one mono channel for each model in ISDM, using a context length of 12 seconds. ... We trained all our models using Adam Kingma & Ba (2015), with a learning rate of 10 4, β1 = 0.9, β2 = 0.99 and a batch size of 16. and We set σmin = 10 4, σmax = 1, ρ = 7.