High-Fidelity Audio Compression with Improved RVQGAN

Authors: Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice
Researcher Affiliation Industry Rithesh Kumar* Descript, Inc. Prem Seetharaman* Descript, Inc. Alejandro Luebs Descript, Inc. Ishaan Kumar Descript, Inc. Kundan Kumar Descript, Inc.
Pseudocode No Appendix A provides mathematical equations for a modified codebook learning algorithm, but not structured pseudocode or an algorithm block.
Open Source Code Yes We provide code 1, models, and audio samples 2 that we encourage the reader to listen to. 1https://github.com/descriptinc/descript-audio-codec
Open Datasets Yes We train our model on a large dataset compiled of speech, music, and environmental sounds. For speech, we use the DAPS dataset [26], the clean speech segments from DNS Challenge 4 [10], the Common Voice dataset [2], and the VCTK dataset [40]. For music, we use the MUSDB dataset [31], and the Jamendo dataset [4]. Finally, for environmental sound, we use both the balanced and unbalanced train segments from Audio Set [14].
Dataset Splits No The paper does not explicitly describe a validation dataset split for hyperparameter tuning or model selection. It mentions training data and test data.
Hardware Specification No The paper mentions training on 'a single GPU' but does not provide specific details such as the model, memory, or manufacturer of the GPU.
Software Dependencies No The paper mentions using the 'Adam W optimizer [23]' and concepts like 'Python 3.8', but it does not specify version numbers for any key software libraries, frameworks, or solvers beyond general programming language versions.
Experiment Setup Yes For our ablation study, we train each model with a batch size of 12 for 250k iterations. For our final model, we train with a batch size of 72 for 400k iterations. We train with excerpts of duration 0.38s. We use the Adam W optimizer [23] with a learning rate of 1e 4, β1 = 0.8, and β2 = 0.9, for both the generator and the discriminator. We decay the learning rate at every step, with γ = 0.999996.