Structured Multi-Track Accompaniment Arrangement via Style Prior Modelling

Authors: Jingwei Zhao, Gus Xia, Ziyu Wang, Ye Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the performance of our multi-track accompaniment system. Given that existing methods primarily focus on lead sheet to multi-track arrangement, we ensure a fair comparison by using the two-stage approach discussed in Section 4. In Section 5.1, we present the datasets used and the training details of our model. In Section 5.2, we describe the baseline models used for comparison. Our evaluation is divided into two parts: objective evaluation, detailed in Section 5.3, and subjective evaluation, covered in Section 5.4. For the single-stage piano to multi-track (Stage 2) and lead sheet to piano (Stage 1) arrangement tasks, we perform additional comparisons with various ablation architectures in Section 5.5 and 5.6, respectively.
Researcher Affiliation Academia Jingwei Zhao1,3 Gus Xia4,5 Ziyu Wang5,4 Ye Wang2,1,3 1Institute of Data Science, NUS 2School of Computing, NUS 3Integrative Sciences and Engineering Programme, NUS Graduate School 4Machine Learning Department, MBZUAI 5Computer Science Department, NYU Shanghai
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code Yes We release our code and more resources at https://github.com/zhaojw1998/Structured-Arrangement-Code.
Open Datasets Yes We use two datasets to train the autoencoder and the style prior, respectively. The autoencoder is trained with Slahk2100 [25], which contains 2K curated multi-track pieces with 34 instrument classes in a balanced distribution. [...] We use Lakh MIDI Dataset (LMD) [28] to train the prior model. It contains 170k music pieces and is a benchmark dataset for training music generative models.
Dataset Splits Yes We use the official training split and augment training samples by transposing to all 12 keys. [...] We collect 2/4 and 4/4 pieces (110k after processing) and randomly split LMD at song level into training (95%) and validation (5%) sets.
Hardware Specification Yes The autoencoder comprises 19M learnable parameters and is trained with batch size 64 for 30 epochs on an RTX A5000 GPU with 24GB memory. [...] Our prior model has 30M parameters and is trained with batch size 16 for 10 epochs (600K iterations) on two RTX A5000 GPUs.
Software Dependencies No The paper mentions using "Adam optimizer [19]" and "Adam W optimizer [22]" but does not specify version numbers for these or other software dependencies like Python or PyTorch.
Experiment Setup Yes The autoencoder comprises 19M learnable parameters and is trained with batch size 64 for 30 epochs on an RTX A5000 GPU with 24GB memory. We use Adam optimizer [19] with a learning rate from 1e-3 exponentially decayed to 1e-5. We use exponential moving average (EMA) [29] and random restart [7] to update the codebook with commitment ratio β = 0.25. Our prior model has 30M parameters and is trained with batch size 16 for 10 epochs (600K iterations) on two RTX A5000 GPUs. We apply Adam W optimizer [22] with a learning rate of 1e-4, scheduled by a 1k-step linear warm-up followed by a single cycle of cosine decay to a final rate of 1e-6.