Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
An Independence-promoting Loss for Music Generation with Language Models
Authors: Jean-Marie Lemercier, Simon Rouard, Jade Copet, Yossi Adi, Alexandre Défossez
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on music generation, and run ablations with respect to our independence-promoting loss configurations. |
| Researcher Affiliation | Collaboration | 1Universit at Hamburg 2IRCAM 3Meta AI 4The Hebrew University of Jerusalem 5Kyutai |
| Pseudocode | Yes | Algorithm 1 MMD Optimization Input: Training macro-batch X %B,L Encode Xe = Eθ(X) %B,T,D Quantize Z = Q(Xe) %B,K,T,N Optional: Apply delay Z(t) .,k = Z(t k+1) .,k Group time with batch axes Z.,k Z.,k,. %B*T,K,N for codebook index k {1, . . . , K} do Sample permutation π U(SBT ) Shuffle batch axis { Zi,k}BT i=1 = {Zπ(i),k}BT i=1 end for Compute independence loss (7) Linde= MMD(PZ||P Z) |
| Open Source Code | Yes | Please visit our companion website1 for audio examples, support with code, etc. 1encodec-mmd.github.io |
| Open Datasets | Yes | We use 20K hours of licensed music to train both En Codec and the language model. The training dataset is composed of an internal dataset of 10K high-quality music tracks, and the Shutter Stock and Pond5 music data collections2, respectively consisting of 25K and 365K music tracks. 2www.shutterstock.com/music www.pond5.com |
| Dataset Splits | Yes | For ablation studies, we rely on a held-out internal evaluation set featuring 528 music tracks. |
| Hardware Specification | Yes | Models are trained for 600k steps on 8 V100 GPUs... The model is trained on cross-entropy (LCE) for 1M steps on 32 V100 GPUs... We make these samples fit on a V100 GPU by using gradient checkpointing during encoding... |
| Software Dependencies | No | The paper mentions optimizers like Adam and AdamW, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, or other libraries). |
| Experiment Setup | Yes | Models are trained for 600k steps on 8 V100 GPUs with the Adam optimizer, using β1 = 0.5, β2 = 0.9, a learning rate of 3 10 4, a batch size of 64 and segments of 1 second cropped at random in audio sequences. The model is trained on cross-entropy (LCE) for 1M steps on 32 V100 GPUs with the Adam W optimizer, using β1 = 0.9, β2 = 0.95, a batch size of 192, and audio sequences of 30 seconds. We use a cosine learning rate schedule with a 4000-steps warmup. Exponential moving average with a decay of 0.99 is used to recursively smooth model weights. Top-250 sampling is used with a temperature of 1 during inference (Fan et al., 2018). We use a weight of 103 for the independence loss Linde, computed in a separate backward. |