Efficient Neural Music Generation

Authors: Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Wang Yuping, Yuxuan Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results suggest the superiority of Me Lo Dy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. We evaluate the performance of Me Lo Dy by comparing it to Music LM [5] and Noise2Music [6]
Researcher Affiliation Industry Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan Wang Speech, Audio & Music Intelligence (SAMI), Byte Dance
Pseudocode No Not found. The paper states: 'For more details of generation, we present the corresponding algorithms in Appendix C.' However, the algorithms themselves are not present in the provided text.
Open Source Code No Not found. The paper provides a link to a demo page for samples ('Our samples are available at https://Efficient-Me Lo Dy.github.io/'), but does not explicitly state that the source code for the methodology is openly available or provide a link to a code repository.
Open Datasets No Not found. The paper states: 'Me Lo Dy was trained on 257k hours of music data (6.4M 24k Hz audios), which were filtered with [27] to focus on non-vocal music.' It also mentions filtering 'in-house music data' (footnote 1). While it uses public tools and public datasets for *evaluation* (Music Caps), the primary training dataset is described as 'in-house' with no public access information provided.
Dataset Splits No Not found. The paper describes 'training inputs' and 'test' evaluation, but does not explicitly provide details about a validation dataset split with specific percentages, counts, or a defined methodology for splitting for validation.
Hardware Specification Yes The speed and the quality of our proposed Me Lo Dy on a CPU (Intel Xeon Platinum 8260 CPU @ 2.40GHz) or a GPU (NVIDIA Tesla V100) using different numbers of sampling steps.
Software Dependencies No Not found. The paper mentions various software components and models (e.g., LLaMA, Wav2Vec2-Conformer, Chat GPT, Hi Fi-GAN, PyTorch via its framework usage), but does not provide specific version numbers for any of them.
Experiment Setup Yes Semantic LM For semantic modeling, we trained a 429.5M LLa MA [71] with 24 layers, 8 heads, and 2048 hidden dimensions... For conditioning, we set up the Mu Lan RVQ using 12 1024-sized codebooks... The training targets were 10s semantic tokens... 199.5M Wav2Vec2-Conformer with 1024-center k-means. Dual-Path Diffusion For the DPD model, we set the hidden dimension to Dhid = 768, and block number to N = 8, resulting in 296.6M parameters. For the input chunking strategy, we divide the 10s training inputs in a fixed length of L = 2500 into M = 4 parts. For segmentation, we used a segment size of K = 64... For sampling, the unconditional prediction vuncond and the conditional prediction vcond are linearly combined: vcond + (1 )vuncond with a scale of = 2.5. Audio VAE-GAN For audio VAE-GAN, we used a hop size of 96, resulting in 250Hz latent sequences for encoding 24k Hz music audio. The latent dimension D = 16, thus we have a total compression rate of 6 . The hidden channels used in the encoder were 256, whereas that used in the decoder were 768.