N-gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding

Authors: Jinhao Tian, Zuchao Li, Jiajia Li, Ping Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment on various datasets demonstrate the effectiveness of our method and achieved state-of-the-art performance on a series of music understanding downstream tasks.
Researcher Affiliation Academia 1National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, 430072, P. R. China, 2Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China, 3School of Information Management, Wuhan University, Wuhan 430072, China
Pseudocode Yes Algorithm 1: Unsupervised Compoundation Construct Algorithm
Open Source Code No The code and model weights will be released at https://github.com/CinqueOrigin/NG-Midiformer.
Open Datasets Yes Pianist8 (joann8512 2021), EMOPIA (Hung et al. 2021), GTZAN (Sturm 2013), Nottingham (Allwright 2003), and POP909 (Wang* et al. 2020).
Dataset Splits Yes Throughout both pre-training and fine-tuning, we designated 90% of each task s dataset for training and the remaining 10% for validation.
Hardware Specification Yes Pre-training took 44 hours (about 128k steps) on 4 NVIDIA Ge Force RTX 3060 GPUs, using the Adam W optimizer and a learning rate that warmed up over the initial 1k steps.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages.
Experiment Setup Yes We set the UCW vocabulary size at 1000... We chose an N-value of 4, excluding N-grams with frequencies below 200. We used the same hyper-parameters as the MIDI-Bert model (Chou et al. 2021), which has a 12-layer structure with 12 self-attention heads, and a hidden layer size of 768 for each self-attention layer. We set a sequence length of 512 for both training stages... using the Adam W optimizer and a learning rate that warmed up over the initial 1k steps... masking 15% of input tokens for prediction... For all tasks, we fine-tuned our pre-trained model for up to 15 epochs, maintaining a consistent sequence length of 512 CP tokens.