Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unifying Symbolic Music Arrangement: Track-Aware Reconstruction and Structured Tokenization

Authors: Longshen Ou, Jingwei Zhao, Ziyu Wang, Gus Xia, Qihao Liang, Torin Hopkins, Ye Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a unified framework for automatic multitrack music arrangement that enables a single pre-trained symbolic music model to handle diverse arrangement scenarios, including reinterpretation, simplification, and additive generation. [...] Our method outperforms task-specific state-of-the-art models on representative tasks in different arrangement scenarios band arrangement, piano reduction, and drum arrangement, in both objective metrics and perceptual evaluations. [...] We evaluate our method on three representative music arrangement tasks that reflect typical arrangement scenarios, each assessing different capabilities of the model: band arrangement (reinterpretation), piano reduction (simplification), and drum arrangement (additive generation).
Researcher Affiliation	Academia	Sound and Music Computing Lab, School of Computing, NUS Courant Institute of Mathematical Sciences, New York University Music X Lab, MBZUAI
Pseudocode	No	The paper describes methods and processes in detail within the main text and appendices but does not feature any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Demos and code: https://www.oulongshen.xyz/automatic_arrangement. [...] All datasets used in this work are open-source. Please refer to the code in the supplementary material, which will be released upon acceptance.
Open Datasets	Yes	Pre-training adopted the Los Angeles MIDI dataset [15] (405K MIDI files, 4.3B tokens after REMI-z tokenization, 2% validation split) and fine-tuning was done with Slakh2100 [21] (1,289 training, 270 validation, 151 test MIDI files), featuring 34 pitched instruments and drums, with 4 tracks per piece. [...] All datasets used in this work are open-source.
Dataset Splits	Yes	Pre-training adopted the Los Angeles MIDI dataset [15] (405K MIDI files, 4.3B tokens after REMI-z tokenization, 2% validation split) and fine-tuning was done with Slakh2100 [21] (1,289 training, 270 validation, 151 test MIDI files)
Hardware Specification	Yes	Pre-training used four RTX A5000 GPUs (batch size 12, 1 epoch), while fine-tuning used a single A40 GPU (variable batch size, 3 epochs).
Software Dependencies	No	The pre-training was implemented using pytorch and transformers frameworks on a Linux platform, while fine-tuning additionally utilized lightning. Specific version numbers for these frameworks are not provided.
Experiment Setup	Yes	Our model, an 80M-parameter decoder-only Transformer, has a hidden dimension of 768, 12 layers, 16-head attention, and a context length of 2048 tokens (around 8 the longest bar in our dataset). The model first undergoes a standard next-token-prediction pre-training, and then was fine-tuned with the proposed objective. Pre-training used four RTX A5000 GPUs (batch size 12, 1 epoch), while fine-tuning used a single A40 GPU (variable batch size, 3 epochs). Detailed hyperparameter settings are in Appendix B.5. For fine-tuning, we conducted a simple learning rate search over 1e-5, 5e-5, 1e-4, selecting the optimal value based on validation loss. This resulted in learning rates of 5e-5 for drum arrangement and 1e-4 for band arrangement and piano reduction. The batch sizes and context lengths were configured as follows: band arrangement used a batch size of 24 and context length of 768; piano reduction similarly adopted a batch size of 24 and context length of 768; drum arrangement employed a batch size of 8 and context length of 1536. Across all fine-tuning tasks, we used the Adam W optimizer with 0.01 weight decay, incorporating a linear learning rate scheduler with 500-step warmup. Training spanned 3 epochs for band, piano, and drum tasks, with early stopping patience of 2 epochs. The best checkpoints were selected based on validation loss.