Simple and Controllable Music Generation
Authors: Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Defossez
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MUSICGEN. |
| Researcher Affiliation | Collaboration | Jade Copet Felix Kreuk Itai Gat Tal Remez David Kant Gabriel Synnaeve Yossi Adi Alexandre Défossez : equal contributions, : core team Meta AI {jadecopet, felixkreuk, adiyoss}@meta.com ... Yossi Adi is Affiliated with both The Hebrew University of Jerusalem & Meta AI. |
| Pseudocode | No | The paper describes methods and patterns but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code. |
| Open Source Code | Yes | Music samples, code, and models are available at github.com/facebookresearch/audiocraft. |
| Open Datasets | Yes | We use 20K hours of licensed music to train MUSICGEN. Specifically, we rely on an internal dataset of 10K high-quality music tracks, and on the Shutter Stock and Pond5 music data collections2 with respectively 25K and 365K instrument-only music tracks. ... 2www.shutterstock.com/music and www.pond5.com |
| Dataset Splits | No | The paper mentions training on '20K hours of licensed music' and evaluating on 'Music Caps benchmark' and an 'in-domain held out evaluation set of 528 music tracks,' but it does not specify explicit train/validation/test dataset splits (e.g., percentages or counts) for its main training dataset. |
| Hardware Specification | No | We train the 300M, 1.5B and 3.3B parameter models, using respectively 32, 64 and 96 GPUs, with mixed precision. More specifically, we use float16 as bfloat16 was leading to instabilities in our setup. |
| Software Dependencies | No | We use a memory efficient Flash attention [Dao et al., 2022] from the x Formers package [Lefaudeux et al., 2022] to improve both speed and memory usage with long sequences. |
| Experiment Setup | Yes | We train on 30-second audio crops sampled at random from the full track. We train the models for 1M steps with the Adam W optimizer [Loshchilov and Hutter, 2017], a batch size of 192 examples, β1 = 0.9, β2 = 0.95, a decoupled weight decay of 0.1 and gradient clipping of 1.0. We further rely on D-Adaptation based automatic step-sizes [Defazio and Mishchenko, 2023] for the 300M model as it improves model convergence but showed no gain for the bigger models. We use a cosine learning rate schedule with a warmup of 4000 steps. Additionally, we use an exponential moving average with a decay of 0.99. Finally, for sampling, we employ top-k sampling [Fan et al., 2018] with keeping the top 250 tokens and a temperature of 1.0. |