InstructME: An Instruction Guided Music Edit Framework with Latent Diffusion Models

Authors: Bing Han, Junyu Dai, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen, Yuxuan Wang, Yanmin Qian, Xuchen Song

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Both subjective and objective evaluations indicate that our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony. Experimental results of public and private datasets demonstrate that Instruct ME outperforms the previous system.
Researcher Affiliation Collaboration Bing Han1 , Junyu Dai2 , Weituo Hao2 , Xinyan He2 , Dong Guo2 , Jitong Chen2 , Yuxuan Wang2 , Yanmin Qian1 and Xuchen Song2 1Auditory Cognition and Computational Acoustics Lab, Shanghai Jiao Tong University 2Byte Dance
Pseudocode No The paper describes the model architecture and processes but does not include any structured pseudocode or algorithm blocks.
Open Source Code No Demo samples are available at https: //musicedit.github.io/
Open Datasets Yes To study the generalization ability of Instruct ME, we also test it on the public available dataset Slakh [Manilow et al., 2019] which is a dataset of multitrack audio and has no overlap with the training data.
Dataset Splits No We split the in-house data randomly into two parts and use one subset to generate triplet data for evaluating the models.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory used for running its experiments.
Software Dependencies No For each text instruction y, a pretrained T5 [Raffel et al., 2020] converts it into sequence of embeddings... and ...the audio classification model is implemented with VGGish [Hershey et al., 2017].
Experiment Setup Yes For model optimization, we use reweighted bound [Ho et al., 2020; Rombach et al., 2022] as objective function: LDM = Eϵ,t,z0 ϵ ϵθ(t, T (y), zs, zt) 2 2 (2) with t uniformly sampled from [1, T] during the training. The process involves three steps: segmentation of T-frame embeddings into K-frame chunks with 50% overlap, individual chunk processing through a transformer layer, and fusion to merge overlapping output chunks into T frames by addition. where w can determine the strength of guidance. with factor s to control the guidance scale.