InstructME: An Instruction Guided Music Edit Framework with Latent Diffusion Models
Authors: Bing Han, Junyu Dai, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen, Yuxuan Wang, Yanmin Qian, Xuchen Song
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both subjective and objective evaluations indicate that our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony. Experimental results of public and private datasets demonstrate that Instruct ME outperforms the previous system. |
| Researcher Affiliation | Collaboration | Bing Han1 , Junyu Dai2 , Weituo Hao2 , Xinyan He2 , Dong Guo2 , Jitong Chen2 , Yuxuan Wang2 , Yanmin Qian1 and Xuchen Song2 1Auditory Cognition and Computational Acoustics Lab, Shanghai Jiao Tong University 2Byte Dance |
| Pseudocode | No | The paper describes the model architecture and processes but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Demo samples are available at https: //musicedit.github.io/ |
| Open Datasets | Yes | To study the generalization ability of Instruct ME, we also test it on the public available dataset Slakh [Manilow et al., 2019] which is a dataset of multitrack audio and has no overlap with the training data. |
| Dataset Splits | No | We split the in-house data randomly into two parts and use one subset to generate triplet data for evaluating the models. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory used for running its experiments. |
| Software Dependencies | No | For each text instruction y, a pretrained T5 [Raffel et al., 2020] converts it into sequence of embeddings... and ...the audio classification model is implemented with VGGish [Hershey et al., 2017]. |
| Experiment Setup | Yes | For model optimization, we use reweighted bound [Ho et al., 2020; Rombach et al., 2022] as objective function: LDM = Eϵ,t,z0 ϵ ϵθ(t, T (y), zs, zt) 2 2 (2) with t uniformly sampled from [1, T] during the training. The process involves three steps: segmentation of T-frame embeddings into K-frame chunks with 50% overlap, individual chunk processing through a transformer layer, and fusion to merge overlapping output chunks into T frames by addition. where w can determine the strength of guidance. with factor s to control the guidance scale. |