From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
Authors: Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Defossez
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed approach considering both objective metrics and human studies. As we demonstrate empirically, such an approach can be applied to a wide variety of tasks and audio domains to replace the traditional GAN based decoders. |
| Researcher Affiliation | Collaboration | Robin San Roman}, Yossi Adi},| Antoine Deleforge Romain Serizel Gabriel Synnaeve} Alexandre Defossez} }: FAIR Team, Meta : Universite de Lorraine, CNRS, Inria, LORIA, Nancy, France |: The Hebrew University of Jerusalem |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found in the paper. |
| Open Source Code | Yes | Training and evaluation code are available on the facebookresearch/audiocraft github project. |
| Open Datasets | Yes | We use speech from the train set of Common Voice 7.0 (9096 hours) [Ardila et al., 2019] together with the DNS challenge 4 (2425 hours) [Dubey et al., 2022]. For music, we use the MTG-Jamendo dataset (919h) [Bogdanov et al., 2019]. For the environmental sound we use FSD50K (108 hours) [Fonseca et al., 2021] and Audio Set (4989 hours) [Gemmeke et al., 2017]. |
| Dataset Splits | No | The paper does not explicitly state the training, validation, and test dataset splits with percentages, absolute counts, or references to predefined splits for reproducibility. It mentions using 'the train set of Common Voice' and sampling from a 'test set', but no details on how data was partitioned for validation. |
| Hardware Specification | Yes | It takes around 2 days on 4 Nvidia V100 with 16 GB to train one of the 4 models. |
| Software Dependencies | No | The paper mentions software like 'Vi SQOL [Chinen et al., 2020] metric' and 'julius 1' with a GitHub link, but it does not specify version numbers for these or other key software components (e.g., programming language, deep learning frameworks), which is required for reproducible dependency descriptions. |
| Experiment Setup | Yes | We trained our diffusion models using our proposed power schedule with power p = 7.5, β0 = 1.0e 5 and βT = 2.9e 2. ... We train our models using Adam optimizer with batch size 128 and a learning rate of 1e-4. |