SongMASS: Automatic Song Writing with Pre-training and Alignment Constraint

Authors: Zhonghao Sheng, Kaitao Song, Xu Tan, Yi Ren, Wei Ye, Shikun Zhang, Tao Qin13798-13805

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results with objective and subjective evaluations demonstrate that Song MASS significantly improves the quality of lyric and melody generation with the help of pre-training and alignment constraint.
Researcher Affiliation Collaboration 1National Engineering Research Center for Software Engineering, Peking University 2Nanjing University of Science and Technology 3Microsoft Research Asia 4Zhejiang University
Pseudocode Yes Algorithm 1 DP for Melody-Lyric Alignment
Open Source Code No Melody and lyric samples are available at: https: //musicgeneration.github.io/Song MASS/ - This link provides samples, not the source code for the methodology.
Open Datasets Yes We use 380,000+ lyrics from Metro Lyrics 1 as our unpaired lyrics for pretraining, which contains 362,237 songs. The lyrics in each song are split into sentences by the line break. For unpaired melodies, we choose The Lakh MIDI Dataset (Raffel 2016)2. We extract the melody tracks by Midi-miner3, and get 65,954 melodies as our unpaired data for pre-training finally. ... We use the LMD dataset (Yu and Canales 2019)4 which contains aligned melodies and lyrics from 7,998 songs.
Dataset Splits Yes The dataset is split as training/valid/test set with a ratio of 8:1:1.
Hardware Specification Yes The model is trained on a NVIDIA Tesla T4 GPU card
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were explicitly mentioned.
Experiment Setup Yes We choose Transformer (Vaswani et al. 2017) as our basic model structure, which consists of 6 encoder/decoder layers. The hidden size and filter size of each layer are set as 512 and 2048. The number of attention heads is 8. We use the same masking strategy as in Song et al. (2019). We use Adam optimizer (Kingma and Ba 2015) with a learning rate of 5e-4. The model is trained on a NVIDIA Tesla T4 GPU card, and each mini-batch contains 4096 tokens. The hyper-parameter α is set as 0.5. The dataset is split as training/valid/test set with a ratio of 8:1:1.