Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MusicRL: Aligning Music Generation to Human Preferences
Authors: Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej Kastelic, Zalán Borsos, Brian Mcwilliams, Victor Ungureanu, Olivier Bachem, Olivier Pietquin, Matthieu Geist, Leonard Hussenot, Neil Zeghidour, Andrea Agostinelli
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose Music RL, the first music generation system finetuned from human feedback. ... We deploy Music LM to users and collect a substantial dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning from Human Feedback (RLHF), we train Music RL-U... Human evaluations show that both Music RL-R and Music RL-U are preferred to the baseline. Ultimately, Music RL-RU combines the two approaches and results in the best model according to human raters. Ablation studies shed light on the musical attributes influencing human preferences... |
| Researcher Affiliation | Industry | 1Google Deep Mind 2Now at Cohere 3Now at Kyutai. |
| Pseudocode | No | The paper describes the RL fine-tuning procedure and mathematical formulations (e.g., in Section 3.2) but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions 'Website with samples' in the abstract, but does not explicitly state that the source code for the described methodology is publicly available or provide a link to a code repository. |
| Open Datasets | Yes | We split 10,028 captions from Music Caps (Agostinelli et al., 2023) into 35,333 single sentences describing music. |
| Dataset Splits | Yes | We split the user preference dataset into a train split of size 285,000 and an evaluation split of size 15,000. |
| Hardware Specification | Yes | For the RL-finetuning, we use ... 128 TPU cores of Cloud TPU v5e. For the training of the user preference reward, we use ... 32 TPU cores of Cloud TPU v4. |
| Software Dependencies | No | The paper mentions using 'Adafactor (Shazeer & Stern, 2018) for the optimizer' but does not specify a version number for Adafactor or other key software components used in the experiments. |
| Experiment Setup | Yes | The common decoding scheme is temperature sampling with temperature T = 0.99." and "For the RL-finetuning, we use a KL regularization strength of 0.001, a policy learning rate of 0.00001, a value learning rate of 0.0001... |