Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models

Authors: Ziyu Wang, Lejun Min, Gus Xia

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments and analysis show that our model is capable of generating full-piece music with recognizable global verse-chorus structure and cadences, and the music quality is higher than the baselines. Additionally, we show that the proposed model is controllable in a flexible way. We use the POP909 dataset to train our model (Wang et al., 2020a). We focus our experiments on the generation of Lead Sheet and Accompaniment, the two lower levels of languages. Objective evaluation. For whole-song well-structuredness, we design the Inter-Phrase Latent Similarity (ILS) metric to measure music structure based on content similarity. Subjective evaluation. We design a double-blind online survey that consists of two parts: short-term (8 measures) evaluation of music quality, and whole-song (32 measures) evaluation of both music quality and well-structuredness.
Researcher Affiliation Academia Ziyu Wang12, Lejun Min2, Gus Xia21 1Computer Science Department, NYU Shanghai, 2Machine Learning Department, MBZUAI ziyu.wang@nyu.edu, {lejun.min, gus.xia}@mbzuai.ac.ae
Pseudocode Yes Algorithm 1 Whole-song generation algorithm.
Open Source Code Yes We release the complete source code and model checkpoints at https://github.com/ZZWaang/ whole-song-gen. The demo page is available at https://wholesonggen.github.io.
Open Datasets Yes We use the POP909 dataset to train our model (Wang et al., 2020a).
Dataset Splits No The paper states '90% of the songs are used for training and the rest 10% are used for testing.' It explicitly mentions train and test splits but does not provide details for a separate validation split.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments (e.g., GPU model, CPU type, memory).
Software Dependencies No The paper mentions 'a 2D-UNet with cross-attention' as the backbone neural architecture and 'classifier-free guidance', but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions or specific library versions).
Experiment Setup Yes Table 5: The hyperparameter configuration of diffusion model training. The listed attributes are the same across all four stages. Hyperparameter Configuration Diffusion Steps (N) 1000 Noise Schedule Linear from 1 to 1e-4 UNet Channels 64 UNet Channel Multipliers 1, 2, 4, 4 Batch Size 16 Attention Levels 3, 4 Number of Heads 4 Learning Rate 5e-5