MusicFlow: Cascaded Flow Matching for Text Guided Music Generation

Authors: K R Prajwal, Bowen Shi, Matthew Le, Apoorv Vyas, Andros Tjandra, Mahi Luthra, Baishan Guo, Huiyu Wang, Triantafyllos Afouras, David Kant, Wei-Ning Hsu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experiments, Table 1. Comparisons between Music Flow with previous works in text-to-music generation on the Music Caps dataset., 4.4. Ablation Study
Researcher Affiliation Collaboration 1VGG, University of Oxford, UK. Work done while at Meta. 2Meta, USA.
Pseudocode No The paper describes the model architecture and training process but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that the source code for Music Flow is publicly available, nor does it provide a direct link to its own code repository.
Open Datasets Yes We evaluate our model on Music Caps (Agostinelli et al., 2023), which incorporates 5.5K 10s-long audio samples annotated by expert musicians in total.
Dataset Splits No The paper mentions using 20K hours of proprietary music data for training and evaluating on Music Caps, but it does not provide explicit percentages or counts for training, validation, and test splits for either dataset, nor does it refer to predefined standard splits with sufficient detail for reproduction beyond the 1K subset for subjective evaluation.
Hardware Specification No The paper describes model training details but does not specify the exact hardware (e.g., GPU models, CPU types) used for the experiments.
Software Dependencies No The paper mentions software components like Transformers and Adam optimizer but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup Yes Specifically, the transformers of the first and second stage include 8 and 24 layers of 12 attention heads with 768/3072 embedding/feed-forward network (FFN) dimension, leading to 84M and 246M parameters. The models are trained with an effective batch size of 480K frames, for 300K/600K updates in two stages respectively. For masking, we adopt the span masking strategy and the masking ratio is randomly chosen between 70 100%. Condition dropping probabilities (i.e., p H and p E) are 0.3 for both stages. We use the Adam optimizer (Kingma & Ba, 2014) with learning rate 2e-4, linearly warmed up for 4k steps and decayed over the rest of training.