SongCreator: Lyrics-based Universal Song Generation

Authors: Shun Lei, Yixuan Zhou, Boshi Tang, Max W. Y. Lam, Feng liu, Hangyu Liu, Jingcheng Wu, Shiyin Kang, Zhiyong Wu, Helen Meng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness of Song Creator by achieving state-of-the-art or competitive performances on all eight tasks.
Researcher Affiliation Collaboration 1 Shenzhen International Graduate School, Tsinghua University, Shenzhen 2 Independent Researcher 3 The Chinese University of Hong Kong, Hong Kong SAR {leis21, yx-zhou23}@mails.tsinghua.edu.cn, zywu@sz.tsinghua.edu.cn
Pseudocode No The paper describes the system architecture and process in figures and text but does not include explicit pseudocode or algorithm blocks.
Open Source Code No We are committed to advancing the field responsibly, and therefore, the checkpoints trained on the full dataset will not be released.
Open Datasets Yes We collected approximately 8500 hours of songs with lyrics from the internet for model training, comprising part of the DISCO-10M [69] dataset and some in-house datasets.
Dataset Splits No The paper states that DSLM is trained on 8,500 hours of song data split into 1.7M clips, and some experiments use a 'held-out set', but it does not provide specific percentages or counts for training, validation, and test splits for the main experiments.
Hardware Specification Yes During training, we train the DSLM for 500K steps using 8 NVIDIA A800 GPUs, with a batch size of 8 for each GPU.
Software Dependencies No The paper mentions various open-source libraries and models like BEST-RQ, Demucs, and GPT, along with their GitHub links, but does not specify the exact version numbers for these software dependencies (e.g., 'PyTorch 1.x' or 'Demucs vX.Y').
Experiment Setup Yes During training, we train the DSLM for 500K steps using 8 NVIDIA A800 GPUs, with a batch size of 8 for each GPU. Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9 and follow the same learning rate schedule in [66]. Consistently, top-k sampling is adopted for inference, in which k and temperature are set to 50 and 0.9, respectively.