reproducibilityindex.ai

Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models

Authors: Ziyu Wang, Lejun Min, Gus Xia

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments and analysis show that our model is capable of generating full-piece music with recognizable global verse-chorus structure and cadences, and the music quality is higher than the baselines. Additionally, we show that the proposed model is controllable in a flexible way. We use the POP909 dataset to train our model (Wang et al., 2020a). We focus our experiments on the generation of Lead Sheet and Accompaniment, the two lower levels of languages. Objective evaluation. For whole-song well-structuredness, we design the Inter-Phrase Latent Similarity (ILS) metric to measure music structure based on content similarity. Subjective evaluation. We design a double-blind online survey that consists of two parts: short-term (8 measures) evaluation of music quality, and whole-song (32 measures) evaluation of both music quality and well-structuredness.
Researcher Affiliation	Academia	Ziyu Wang12, Lejun Min2, Gus Xia21 1Computer Science Department, NYU Shanghai, 2Machine Learning Department, MBZUAI ziyu.wang@nyu.edu, {lejun.min, gus.xia}@mbzuai.ac.ae
Pseudocode	Yes	Algorithm 1 Whole-song generation algorithm.
Open Source Code	Yes	We release the complete source code and model checkpoints at https://github.com/ZZWaang/ whole-song-gen. The demo page is available at https://wholesonggen.github.io.
Open Datasets	Yes	We use the POP909 dataset to train our model (Wang et al., 2020a).
Dataset Splits	No	The paper states '90% of the songs are used for training and the rest 10% are used for testing.' It explicitly mentions train and test splits but does not provide details for a separate validation split.
Hardware Specification	No	The paper does not specify any particular hardware used for running the experiments (e.g., GPU model, CPU type, memory).
Software Dependencies	No	The paper mentions 'a 2D-UNet with cross-attention' as the backbone neural architecture and 'classifier-free guidance', but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions or specific library versions).
Experiment Setup	Yes	Table 5: The hyperparameter configuration of diffusion model training. The listed attributes are the same across all four stages. Hyperparameter Configuration Diffusion Steps (N) 1000 Noise Schedule Linear from 1 to 1e-4 UNet Channels 64 UNet Channel Multipliers 1, 2, 4, 4 Batch Size 16 Attention Levels 3, 4 Number of Heads 4 Learning Rate 5e-5