Masked Audio Generation using a Single Non-Autoregressive Transformer
Authors: Alon Ziv, Itai Gat, Gael Le Lan, Tal Remez, Felix Kreuk, Jade Copet, Alexandre Défossez, Gabriel Synnaeve, Yossi Adi
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficiency of MAGNET for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. |
| Researcher Affiliation | Collaboration | 1FAIR Team, Meta 2Kyutai 3The Hebrew University of Jerusalem |
| Pseudocode | Yes | A pseudo-code of our entire decoding algorithm is described in Fig. 4, Appendix D. |
| Open Source Code | No | The paper provides links to external, third-party libraries and models used (e.g., Audiocraft, Frechet Audio Distance, CLAP model, Audio-diffusion-pytorch) or a demo page for generated samples, but it does not provide an explicit statement or link for the open-sourcing of the MAGNET model's own code. |
| Open Datasets | Yes | We evaluate the proposed method on the Music Caps benchmark (Agostinelli et al., 2023). |
| Dataset Splits | Yes | For the main results and comparison with prior work, we evaluate the proposed method on the Music Caps benchmark (Agostinelli et al., 2023). Music Caps is composed of 5.5K samples (tensecond long) prepared by expert musicians and a 1K subset balanced across genres. We report objective metrics on the unbalanced set, while we sample examples from the genre-balanced set for qualitative evaluations. We additionally evaluate the proposed method using the same in-domain test set as proposed by Copet et al. (2023). All ablation studies were conducted on the in-domain test set. |
| Hardware Specification | Yes | We train the models using respectively 32 GPUs for small and 64 GPUs for large models, with float16 precision. |
| Software Dependencies | No | The paper mentions several software components like 'Tensorflow', 'VGGish model', 'CLAP model', 'xFormers package', and 'Flash attention', but it does not specify concrete version numbers for any of them, which is required for reproducibility. |
| Experiment Setup | Yes | We train the models for 1M steps with the Adam W optimizer (Loshchilov & Hutter, 2017), a batch size of 192 examples, β1 = 0.9, β2 = 0.95, a decoupled weight decay of 0.1 and gradient clipping of 1.0. ... We use a cosine learning rate schedule with a warmup of 4K steps. Additionally, we use an exponential moving average with a decay of 0.99. ... Finally, for inference, we employ nucleus sampling (Holtzman et al., 2020) with top-p 0.9, and a temperature of 3.0 that is linearly annealed to zero during decoding iterations. We use CFG with a condition dropout of 0.3 at training, and a guidance coefficient 10.0 annealed to 1.0. |