SongMASS: Automatic Song Writing with Pre-training and Alignment Constraint
Authors: Zhonghao Sheng, Kaitao Song, Xu Tan, Yi Ren, Wei Ye, Shikun Zhang, Tao Qin13798-13805
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results with objective and subjective evaluations demonstrate that Song MASS significantly improves the quality of lyric and melody generation with the help of pre-training and alignment constraint. |
| Researcher Affiliation | Collaboration | 1National Engineering Research Center for Software Engineering, Peking University 2Nanjing University of Science and Technology 3Microsoft Research Asia 4Zhejiang University |
| Pseudocode | Yes | Algorithm 1 DP for Melody-Lyric Alignment |
| Open Source Code | No | Melody and lyric samples are available at: https: //musicgeneration.github.io/Song MASS/ - This link provides samples, not the source code for the methodology. |
| Open Datasets | Yes | We use 380,000+ lyrics from Metro Lyrics 1 as our unpaired lyrics for pretraining, which contains 362,237 songs. The lyrics in each song are split into sentences by the line break. For unpaired melodies, we choose The Lakh MIDI Dataset (Raffel 2016)2. We extract the melody tracks by Midi-miner3, and get 65,954 melodies as our unpaired data for pre-training finally. ... We use the LMD dataset (Yu and Canales 2019)4 which contains aligned melodies and lyrics from 7,998 songs. |
| Dataset Splits | Yes | The dataset is split as training/valid/test set with a ratio of 8:1:1. |
| Hardware Specification | Yes | The model is trained on a NVIDIA Tesla T4 GPU card |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were explicitly mentioned. |
| Experiment Setup | Yes | We choose Transformer (Vaswani et al. 2017) as our basic model structure, which consists of 6 encoder/decoder layers. The hidden size and filter size of each layer are set as 512 and 2048. The number of attention heads is 8. We use the same masking strategy as in Song et al. (2019). We use Adam optimizer (Kingma and Ba 2015) with a learning rate of 5e-4. The model is trained on a NVIDIA Tesla T4 GPU card, and each mini-batch contains 4096 tokens. The hyper-parameter α is set as 0.5. The dataset is split as training/valid/test set with a ratio of 8:1:1. |