An Independence-promoting Loss for Music Generation with Language Models
Authors: Jean-Marie Lemercier, Simon Rouard, Jade Copet, Yossi Adi, Alexandre Défossez
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on music generation, and run ablations with respect to our independence-promoting loss configurations. |
| Researcher Affiliation | Collaboration | 1Universit at Hamburg 2IRCAM 3Meta AI 4The Hebrew University of Jerusalem 5Kyutai |
| Pseudocode | Yes | Algorithm 1 MMD Optimization Input: Training macro-batch X %B,L Encode Xe = Eθ(X) %B,T,D Quantize Z = Q(Xe) %B,K,T,N Optional: Apply delay Z(t) .,k = Z(t k+1) .,k Group time with batch axes Z.,k Z.,k,. %B*T,K,N for codebook index k {1, . . . , K} do Sample permutation π U(SBT ) Shuffle batch axis { Zi,k}BT i=1 = {Zπ(i),k}BT i=1 end for Compute independence loss (7) Linde= MMD(PZ||P Z) |
| Open Source Code | Yes | Please visit our companion website1 for audio examples, support with code, etc. 1encodec-mmd.github.io |
| Open Datasets | Yes | We use 20K hours of licensed music to train both En Codec and the language model. The training dataset is composed of an internal dataset of 10K high-quality music tracks, and the Shutter Stock and Pond5 music data collections2, respectively consisting of 25K and 365K music tracks. 2www.shutterstock.com/music www.pond5.com |
| Dataset Splits | Yes | For ablation studies, we rely on a held-out internal evaluation set featuring 528 music tracks. |
| Hardware Specification | Yes | Models are trained for 600k steps on 8 V100 GPUs... The model is trained on cross-entropy (LCE) for 1M steps on 32 V100 GPUs... We make these samples fit on a V100 GPU by using gradient checkpointing during encoding... |
| Software Dependencies | No | The paper mentions optimizers like Adam and AdamW, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, or other libraries). |
| Experiment Setup | Yes | Models are trained for 600k steps on 8 V100 GPUs with the Adam optimizer, using β1 = 0.5, β2 = 0.9, a learning rate of 3 10 4, a batch size of 64 and segments of 1 second cropped at random in audio sequences. The model is trained on cross-entropy (LCE) for 1M steps on 32 V100 GPUs with the Adam W optimizer, using β1 = 0.9, β2 = 0.95, a batch size of 192, and audio sequences of 30 seconds. We use a cosine learning rate schedule with a 4000-steps warmup. Exponential moving average with a decay of 0.99 is used to recursively smooth model weights. Top-250 sampling is used with a temperature of 1 during inference (Fan et al., 2018). We use a weight of 103 for the independence loss Linde, computed in a separate backward. |