reproducibilityindex.ai

N-gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding

Authors: Jinhao Tian, Zuchao Li, Jiajia Li, Ping Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment on various datasets demonstrate the effectiveness of our method and achieved state-of-the-art performance on a series of music understanding downstream tasks.
Researcher Affiliation	Academia	1National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, 430072, P. R. China, 2Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China, 3School of Information Management, Wuhan University, Wuhan 430072, China
Pseudocode	Yes	Algorithm 1: Unsupervised Compoundation Construct Algorithm
Open Source Code	No	The code and model weights will be released at https://github.com/CinqueOrigin/NG-Midiformer.
Open Datasets	Yes	Pianist8 (joann8512 2021), EMOPIA (Hung et al. 2021), GTZAN (Sturm 2013), Nottingham (Allwright 2003), and POP909 (Wang* et al. 2020).
Dataset Splits	Yes	Throughout both pre-training and fine-tuning, we designated 90% of each task s dataset for training and the remaining 10% for validation.
Hardware Specification	Yes	Pre-training took 44 hours (about 128k steps) on 4 NVIDIA Ge Force RTX 3060 GPUs, using the Adam W optimizer and a learning rate that warmed up over the initial 1k steps.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages.
Experiment Setup	Yes	We set the UCW vocabulary size at 1000... We chose an N-value of 4, excluding N-grams with frequencies below 200. We used the same hyper-parameters as the MIDI-Bert model (Chou et al. 2021), which has a 12-layer structure with 12 self-attention heads, and a hidden layer size of 768 for each self-attention layer. We set a sequence length of 512 for both training stages... using the Adam W optimizer and a learning rate that warmed up over the initial 1k steps... masking 15% of input tokens for prediction... For all tasks, we fine-tuned our pre-trained model for up to 15 epochs, maintaining a consistent sequence length of 512 CP tokens.