Efficient Neural Music Generation
Authors: Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Wang Yuping, Yuxuan Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results suggest the superiority of Me Lo Dy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. We evaluate the performance of Me Lo Dy by comparing it to Music LM [5] and Noise2Music [6] |
| Researcher Affiliation | Industry | Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan Wang Speech, Audio & Music Intelligence (SAMI), Byte Dance |
| Pseudocode | No | Not found. The paper states: 'For more details of generation, we present the corresponding algorithms in Appendix C.' However, the algorithms themselves are not present in the provided text. |
| Open Source Code | No | Not found. The paper provides a link to a demo page for samples ('Our samples are available at https://Efficient-Me Lo Dy.github.io/'), but does not explicitly state that the source code for the methodology is openly available or provide a link to a code repository. |
| Open Datasets | No | Not found. The paper states: 'Me Lo Dy was trained on 257k hours of music data (6.4M 24k Hz audios), which were filtered with [27] to focus on non-vocal music.' It also mentions filtering 'in-house music data' (footnote 1). While it uses public tools and public datasets for *evaluation* (Music Caps), the primary training dataset is described as 'in-house' with no public access information provided. |
| Dataset Splits | No | Not found. The paper describes 'training inputs' and 'test' evaluation, but does not explicitly provide details about a validation dataset split with specific percentages, counts, or a defined methodology for splitting for validation. |
| Hardware Specification | Yes | The speed and the quality of our proposed Me Lo Dy on a CPU (Intel Xeon Platinum 8260 CPU @ 2.40GHz) or a GPU (NVIDIA Tesla V100) using different numbers of sampling steps. |
| Software Dependencies | No | Not found. The paper mentions various software components and models (e.g., LLaMA, Wav2Vec2-Conformer, Chat GPT, Hi Fi-GAN, PyTorch via its framework usage), but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | Semantic LM For semantic modeling, we trained a 429.5M LLa MA [71] with 24 layers, 8 heads, and 2048 hidden dimensions... For conditioning, we set up the Mu Lan RVQ using 12 1024-sized codebooks... The training targets were 10s semantic tokens... 199.5M Wav2Vec2-Conformer with 1024-center k-means. Dual-Path Diffusion For the DPD model, we set the hidden dimension to Dhid = 768, and block number to N = 8, resulting in 296.6M parameters. For the input chunking strategy, we divide the 10s training inputs in a fixed length of L = 2500 into M = 4 parts. For segmentation, we used a segment size of K = 64... For sampling, the unconditional prediction vuncond and the conditional prediction vcond are linearly combined: vcond + (1 )vuncond with a scale of = 2.5. Audio VAE-GAN For audio VAE-GAN, we used a hop size of 96, resulting in 250Hz latent sequences for encoding 24k Hz music audio. The latent dimension D = 16, thus we have a total compression rate of 6 . The hidden channels used in the encoder were 256, whereas that used in the decoder were 768. |