N-gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding
Authors: Jinhao Tian, Zuchao Li, Jiajia Li, Ping Wang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment on various datasets demonstrate the effectiveness of our method and achieved state-of-the-art performance on a series of music understanding downstream tasks. |
| Researcher Affiliation | Academia | 1National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, 430072, P. R. China, 2Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China, 3School of Information Management, Wuhan University, Wuhan 430072, China |
| Pseudocode | Yes | Algorithm 1: Unsupervised Compoundation Construct Algorithm |
| Open Source Code | No | The code and model weights will be released at https://github.com/CinqueOrigin/NG-Midiformer. |
| Open Datasets | Yes | Pianist8 (joann8512 2021), EMOPIA (Hung et al. 2021), GTZAN (Sturm 2013), Nottingham (Allwright 2003), and POP909 (Wang* et al. 2020). |
| Dataset Splits | Yes | Throughout both pre-training and fine-tuning, we designated 90% of each task s dataset for training and the remaining 10% for validation. |
| Hardware Specification | Yes | Pre-training took 44 hours (about 128k steps) on 4 NVIDIA Ge Force RTX 3060 GPUs, using the Adam W optimizer and a learning rate that warmed up over the initial 1k steps. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages. |
| Experiment Setup | Yes | We set the UCW vocabulary size at 1000... We chose an N-value of 4, excluding N-grams with frequencies below 200. We used the same hyper-parameters as the MIDI-Bert model (Chou et al. 2021), which has a 12-layer structure with 12 self-attention heads, and a hidden layer size of 768 for each self-attention layer. We set a sequence length of 512 for both training stages... using the Adam W optimizer and a learning rate that warmed up over the initial 1k steps... masking 15% of input tokens for prediction... For all tasks, we fine-tuned our pre-trained model for up to 15 epochs, maintaining a consistent sequence length of 512 CP tokens. |