LLark: A Multimodal Instruction-Following Language Model for Music

Authors: Joshua P Gardner, Simon Durand, Daniel Stoller, Rachel M Bittner

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In evaluations on three types of tasks (music understanding, captioning, reasoning), we show that LLARK matches or outperforms existing baselines in music understanding, and that humans show a high degree of agreement with its responses in captioning and reasoning tasks.
Researcher Affiliation Collaboration 1University of Washington 2Spotify.
Pseudocode No The paper describes procedures in text and uses figures but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes LLARK is trained entirely from open-source music data and models, and we make our training code available along with the release of this paper. Additional results and audio examples are at https://bit.ly/llark, and our source code is available at https://github.com/ spotify-research/llark.
Open Datasets Yes To construct our instruction-tuning datasets, we use a set of only publicly-available, open source, permissively-licensed music datasets. The datasets used for training are summarized in Table 1. For each dataset, we use both the audio and any accompanying annotations.
Dataset Splits Yes We use the default train/test split for FMA. We use the official train-test split for the Music Net dataset. We use a random subset of 1,000 tracks as the test set. The IDs of the tracks in the train and test sets are provided in the code.
Hardware Specification Yes Our model is trained on 4 80GB NVIDIA A100 GPUs. Training takes approximately 54 hours.
Software Dependencies No We provide exact software dependencies for our code, alongside Dockerfiles to reproduce our training and data preprocessing environments. We will publicly release this code on publication of the paper.
Experiment Setup Yes The model is trained for 100k steps with a global batch size of 32, cosine learning rate scheduler with 3000 warmup steps and a maximal learning rate of 5e 5. We use the Adam W optimizer (Loshchilov & Hutter, 2018) with betas=(0.9, 0.999), ϵ = 1e 6, and do not apply weight decay. We fine tune both the projection module and the language model throughout, and freeze the audio encoder. The model is trained with BF16 data type.