An Independence-promoting Loss for Music Generation with Language Models

Authors: Jean-Marie Lemercier, Simon Rouard, Jade Copet, Yossi Adi, Alexandre Défossez

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on music generation, and run ablations with respect to our independence-promoting loss configurations.
Researcher Affiliation Collaboration 1Universit at Hamburg 2IRCAM 3Meta AI 4The Hebrew University of Jerusalem 5Kyutai
Pseudocode Yes Algorithm 1 MMD Optimization Input: Training macro-batch X %B,L Encode Xe = Eθ(X) %B,T,D Quantize Z = Q(Xe) %B,K,T,N Optional: Apply delay Z(t) .,k = Z(t k+1) .,k Group time with batch axes Z.,k Z.,k,. %B*T,K,N for codebook index k {1, . . . , K} do Sample permutation π U(SBT ) Shuffle batch axis { Zi,k}BT i=1 = {Zπ(i),k}BT i=1 end for Compute independence loss (7) Linde= MMD(PZ||P Z)
Open Source Code Yes Please visit our companion website1 for audio examples, support with code, etc. 1encodec-mmd.github.io
Open Datasets Yes We use 20K hours of licensed music to train both En Codec and the language model. The training dataset is composed of an internal dataset of 10K high-quality music tracks, and the Shutter Stock and Pond5 music data collections2, respectively consisting of 25K and 365K music tracks. 2www.shutterstock.com/music www.pond5.com
Dataset Splits Yes For ablation studies, we rely on a held-out internal evaluation set featuring 528 music tracks.
Hardware Specification Yes Models are trained for 600k steps on 8 V100 GPUs... The model is trained on cross-entropy (LCE) for 1M steps on 32 V100 GPUs... We make these samples fit on a V100 GPU by using gradient checkpointing during encoding...
Software Dependencies No The paper mentions optimizers like Adam and AdamW, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, or other libraries).
Experiment Setup Yes Models are trained for 600k steps on 8 V100 GPUs with the Adam optimizer, using β1 = 0.5, β2 = 0.9, a learning rate of 3 10 4, a batch size of 64 and segments of 1 second cropped at random in audio sequences. The model is trained on cross-entropy (LCE) for 1M steps on 32 V100 GPUs with the Adam W optimizer, using β1 = 0.9, β2 = 0.95, a batch size of 192, and audio sequences of 30 seconds. We use a cosine learning rate schedule with a 4000-steps warmup. Exponential moving average with a decay of 0.99 is used to recursively smooth model weights. Top-250 sampling is used with a temperature of 1 during inference (Fan et al., 2018). We use a weight of 103 for the independence loss Linde, computed in a separate backward.