reproducibilityindex.ai

Simple and Controllable Music Generation

Authors: Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Defossez

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MUSICGEN.
Researcher Affiliation	Collaboration	Jade Copet Felix Kreuk Itai Gat Tal Remez David Kant Gabriel Synnaeve Yossi Adi Alexandre Défossez : equal contributions, : core team Meta AI {jadecopet, felixkreuk, adiyoss}@meta.com ... Yossi Adi is Affiliated with both The Hebrew University of Jerusalem & Meta AI.
Pseudocode	No	The paper describes methods and patterns but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code	Yes	Music samples, code, and models are available at github.com/facebookresearch/audiocraft.
Open Datasets	Yes	We use 20K hours of licensed music to train MUSICGEN. Specifically, we rely on an internal dataset of 10K high-quality music tracks, and on the Shutter Stock and Pond5 music data collections2 with respectively 25K and 365K instrument-only music tracks. ... 2www.shutterstock.com/music and www.pond5.com
Dataset Splits	No	The paper mentions training on '20K hours of licensed music' and evaluating on 'Music Caps benchmark' and an 'in-domain held out evaluation set of 528 music tracks,' but it does not specify explicit train/validation/test dataset splits (e.g., percentages or counts) for its main training dataset.
Hardware Specification	No	We train the 300M, 1.5B and 3.3B parameter models, using respectively 32, 64 and 96 GPUs, with mixed precision. More specifically, we use float16 as bfloat16 was leading to instabilities in our setup.
Software Dependencies	No	We use a memory efficient Flash attention [Dao et al., 2022] from the x Formers package [Lefaudeux et al., 2022] to improve both speed and memory usage with long sequences.
Experiment Setup	Yes	We train on 30-second audio crops sampled at random from the full track. We train the models for 1M steps with the Adam W optimizer [Loshchilov and Hutter, 2017], a batch size of 192 examples, β1 = 0.9, β2 = 0.95, a decoupled weight decay of 0.1 and gradient clipping of 1.0. We further rely on D-Adaptation based automatic step-sizes [Defazio and Mishchenko, 2023] for the 300M model as it improves model convergence but showed no gain for the bigger models. We use a cosine learning rate schedule with a warmup of 4000 steps. Additionally, we use an exponential moving average with a decay of 0.99. Finally, for sampling, we employ top-k sampling [Fan et al., 2018] with keeping the top 250 tokens and a temperature of 1.0.