MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

Authors: Yizhi LI, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, Roger Dannenberg, Ruibo Liu, Wenhu Chen, Gus Xia, Yemin Shi, Wenhao Huang, Zili Wang, Yike Guo, Jie Fu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
Researcher Affiliation Collaboration Yizhi Li 1,2 Ruibin Yuan 3,4 Ge Zhang 4,5,6 Yinghao Ma 7 Xingran Chen Hanzhi Yin 3 Chenghao Xiao 8 Chenghua Lin 1,2 Anton Ragni2 Emmanouil Benetos7 Norbert Gyenge2 Roger Dannenberg3 Ruibo Liu9 Wenhu Chen 5 Gus Xia10,11 Yemin Shi 6,12 Wenhao Huang 6 Zili Wang Yike Guo4 Jie Fu 4,6 m-a-p.ai 1University of Manchester 2University of Sheffield 3Carnegie Mellon University 4Hong Kong University of Science and Technology 5University of Waterloo 6Beijing Academy of Artificial Intelligence 7Queen Mary University of London 8Durham University 9Dartmouth College 10MBZUAI 11New York University 12linksoul.ai
Pseudocode Yes Algorithm 1 Pseudocode description of the pre-training loss calculation in Python style.
Open Source Code No We anticipate that our method and the forthcoming public release of our codes and models will catalyse further research into the application of SSL in music audio, thereby broadening the scope and depth of human understanding of music.
Open Datasets Yes Specifically, we provide a special edition of the base model, MERT-95M-public, that is trained on a totally publicly available music dataset, music4all (Santana et al., 2020), with a data size of 910 hours.
Dataset Splits Yes We randomly divided the dataset into a training set, validation set and testing set based on a ratio of 12:8:5, all containing the same 20 singers.
Hardware Specification Yes Models are trained with 64 A100-40GB GPUs with fp16.
Software Dependencies No The paper mentions 'fairseq' as the framework used but does not provide specific version numbers for it or any other key software libraries or dependencies, such as Python or PyTorch versions.
Experiment Setup Yes The effective batch sizes and learning rates for the base model and large model are set to 1.5 and 5.5 hours, and their learning rates are set to 5e 4, 1.5e 3, respectively.