MusicRL: Aligning Music Generation to Human Preferences
Authors: Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej Kastelic, Zalán Borsos, Brian Mcwilliams, Victor Ungureanu, Olivier Bachem, Olivier Pietquin, Matthieu Geist, Leonard Hussenot, Neil Zeghidour, Andrea Agostinelli
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose Music RL, the first music generation system finetuned from human feedback. ... We deploy Music LM to users and collect a substantial dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning from Human Feedback (RLHF), we train Music RL-U... Human evaluations show that both Music RL-R and Music RL-U are preferred to the baseline. Ultimately, Music RL-RU combines the two approaches and results in the best model according to human raters. Ablation studies shed light on the musical attributes influencing human preferences... |
| Researcher Affiliation | Industry | 1Google Deep Mind 2Now at Cohere 3Now at Kyutai. |
| Pseudocode | No | The paper describes the RL fine-tuning procedure and mathematical formulations (e.g., in Section 3.2) but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions 'Website with samples' in the abstract, but does not explicitly state that the source code for the described methodology is publicly available or provide a link to a code repository. |
| Open Datasets | Yes | We split 10,028 captions from Music Caps (Agostinelli et al., 2023) into 35,333 single sentences describing music. |
| Dataset Splits | Yes | We split the user preference dataset into a train split of size 285,000 and an evaluation split of size 15,000. |
| Hardware Specification | Yes | For the RL-finetuning, we use ... 128 TPU cores of Cloud TPU v5e. For the training of the user preference reward, we use ... 32 TPU cores of Cloud TPU v4. |
| Software Dependencies | No | The paper mentions using 'Adafactor (Shazeer & Stern, 2018) for the optimizer' but does not specify a version number for Adafactor or other key software components used in the experiments. |
| Experiment Setup | Yes | The common decoding scheme is temperature sampling with temperature T = 0.99." and "For the RL-finetuning, we use a KL regularization strength of 0.001, a policy learning rate of 0.00001, a value learning rate of 0.0001... |