Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LeVo: High-Quality Song Generation with Multi-Preference Alignment

Authors: Shun Lei, Yaoxun XU, ZhiweiLin, Huaicheng Zhang, Wei tan, Hangting Chen, Yixuan Zhang, Chenyu Yang, Haina Zhu, Shuai Wang, Zhiyong Wu, Dong Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Le Vo significantly outperforms existing open-source methods in both objective and subjective metrics, while performing competitively with industry systems. Ablation studies further justify the effectiveness of our designs.
Researcher Affiliation Collaboration 1 Shenzhen International Graduate School, Tsinghua University, Shenzhen 2 Tencent AI Lab 3 Wuhan University 4 The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shenzhen, China 5 X-LANCE Lab, Shanghai Jiao Tong University, Shanghai 6 School of Intelligence Science and Technology, Nanjing University, Suzhou, China
Pseudocode No The paper describes the methodology through prose and architectural diagrams (Figure 1, Figure 2, Figure 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Audio examples and source code are available at https://levo-demo.github.io/ and https://github.com/tencent-ailab/songgeneration.
Open Datasets Yes We trained our Di T model using a large-scale dataset of 300,000 music samples sourced from the internet, incorporating subsets of DISCO-10M [56], the Million Song Dataset [57], and our own in-house collections.
Dataset Splits Yes From this dataset, we randomly selected 10,000 music samples to serve as our test set. For each training epoch, we randomly sampled 100,000 audio clips from the full training set, while 200 samples were drawn from the test set for evaluation during that epoch.
Hardware Specification Yes During training, we used 32 NVIDIA H20 GPUs, with a batch size of 2 for each GPU to train the Le LM for 265K steps. For both the Music Codec (Mixed) and Music Codec (Dual-Track) models, we utilized 8 40GB NVIDIA A100 GPUs for each configuration, training each model s Di T for 300,000 steps.
Software Dependencies No The paper references various models and architectures (e.g., Qwen2 tokenizer, GPT-2 architecture, Stable Audio, Whisper, wav2vec 2.0) but does not provide specific version numbers for software dependencies or libraries (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes During training, we used 32 NVIDIA H20 GPUs, with a batch size of 2 for each GPU to train the Le LM for 265K steps. Of these, 200k steps are required for the pre-training, with 60k steps for modular extension training, then 5k steps for multi-preference alignment. Adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9 and follow the same learning rate schedule in [58]. Consistently, top-k sampling is adopted for inference, in which k and temperature are set to 50 and 0.9, respectively. Table 4: Hyper-parameters of Le LM. Language model: Encoder Layers 28, Hidden Size 1536, Attention Head 12, Feed-Forward Dim 8960, Max Context Length (in #tokens) 8196. Decoder: Layers 12, Hidden Size 1536, Attention Head 12, Feed-Forward Dim 8960, Max Context Length (in #tokens) 10000.