High-Fidelity Audio Compression with Improved RVQGAN
Authors: Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice |
| Researcher Affiliation | Industry | Rithesh Kumar* Descript, Inc. Prem Seetharaman* Descript, Inc. Alejandro Luebs Descript, Inc. Ishaan Kumar Descript, Inc. Kundan Kumar Descript, Inc. |
| Pseudocode | No | Appendix A provides mathematical equations for a modified codebook learning algorithm, but not structured pseudocode or an algorithm block. |
| Open Source Code | Yes | We provide code 1, models, and audio samples 2 that we encourage the reader to listen to. 1https://github.com/descriptinc/descript-audio-codec |
| Open Datasets | Yes | We train our model on a large dataset compiled of speech, music, and environmental sounds. For speech, we use the DAPS dataset [26], the clean speech segments from DNS Challenge 4 [10], the Common Voice dataset [2], and the VCTK dataset [40]. For music, we use the MUSDB dataset [31], and the Jamendo dataset [4]. Finally, for environmental sound, we use both the balanced and unbalanced train segments from Audio Set [14]. |
| Dataset Splits | No | The paper does not explicitly describe a validation dataset split for hyperparameter tuning or model selection. It mentions training data and test data. |
| Hardware Specification | No | The paper mentions training on 'a single GPU' but does not provide specific details such as the model, memory, or manufacturer of the GPU. |
| Software Dependencies | No | The paper mentions using the 'Adam W optimizer [23]' and concepts like 'Python 3.8', but it does not specify version numbers for any key software libraries, frameworks, or solvers beyond general programming language versions. |
| Experiment Setup | Yes | For our ablation study, we train each model with a batch size of 12 for 250k iterations. For our final model, we train with a batch size of 72 for 400k iterations. We train with excerpts of duration 0.38s. We use the Adam W optimizer [23] with a learning rate of 1e 4, β1 = 0.8, and β2 = 0.9, for both the generator and the discriminator. We decay the learning rate at every step, with γ = 0.999996. |