Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement

Authors: Chenyu Yang, Shuai Wang, Hangting Chen, Wei Tan, Jianwei Yu, Haizhou Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Song Bloom outperforms existing methods across both subjective and objective metrics
Researcher Affiliation Collaboration 1The Chinese University of Hong Kong, Shenzhen 2Tencent AI Lab 3Nanjing University 4Shenzhen Research Institution of Big Data
Pseudocode No The paper describes the model architecture and training details in prose and mathematical equations without using explicit pseudocode or algorithm blocks.
Open Source Code Yes 1Code: https://github.com/Cypress-Yang/Song Bloom
Open Datasets No We use a large-scale song dataset totaling 100K hours, including both Chinese and English songs, to train our model. The original lyrics are of relatively low quality, prompting the use of a dedicated data cleaning pipeline." and "Some training data is sensitive and cannot be made public.
Dataset Splits No We use a large-scale song dataset totaling 100K hours, including both Chinese and English songs, to train our model. The original lyrics are of relatively low quality, prompting the use of a dedicated data cleaning pipeline." and "A subset comprising 20 samples is used for subsequent human evaluation." There is no explicit mention of training, validation, or test splits for the main dataset.
Hardware Specification Yes The model is trained with 16 A100 GPUs for approximately one week.
Software Dependencies No The paper mentions several models and toolkits used (Demucs, Whisper X, stable-audio-vae, LLa MA-2, Flash Attention2, AdamW optimizer, DeepSpeed strategy) but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes In our experiments, the codebook size for semantic tokens is set to 16384, at a frame rate of 25. The stable-audio-vae [36] is adopted as the implementation of our waveform autoencoder. Minor modifications are made to its hyperparameters to ensure they have the same frame rate, thereby facilitating synchronized sketches and latents. The core component of Song Bloom is based on the LLa MA-2 decoder architecture [37], utilizing causal attention as the backbone of the autoregressive LM. This architecture is further modified to support bidirectional attention, forming our diffusion transformer. Rotary Positional Embeddings (Ro PE) [38] are employed in both autoregressive and non-autoregressive transformers to encode positional information. The acoustic encoder is a simple two-layer convolutional network. All conditioning inputs are prepended to the input sequence. Attention modules in each layer use 24 heads with a hidden dimension of 1536, consistent across both autoregressive and non-autoregressive components. The patch size is set to 16, spanning 0.64 seconds. We evaluate two model configurations in our experiments: (1) Song Bloom-tiny, comprising 16 layers for the autoregressive LM and 8 layers for the non-autoregressive diffusion transformer, capable of generating songs up to 60 seconds in length; and (2) Song Bloom-full, comprising 24 autoregressive layers and 12 non-autoregressive layers, enabling song generation up to 150 seconds. The former is used for analysis and ablation studies, while the latter is included to enable fair comparisons with other baselines. The model is trained with 16 A100 GPUs for approximately one week. All models are trained using the Adam W optimizer [39] with a learning rate of 1e-4. A cosine learning rate scheduler [40] with 2000 warm-up steps is employed to stabilize early training. Each model is trained for approximately 150K steps with a batch size of 128. The Deep Speed strategy [41] is adopted to support efficient training. For inference, both stages share a classifier-free guidance coefficient of 1.5. Next-token prediction is performed using top-k sampling with k = 200 and a temperature of 0.9. The diffusion process employs the Euler ODE solver with 36 diffusion steps.