reproducibilityindex.ai

High-Fidelity Audio Compression with Improved RVQGAN

Authors: Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice
Researcher Affiliation	Industry	Rithesh Kumar* Descript, Inc. Prem Seetharaman* Descript, Inc. Alejandro Luebs Descript, Inc. Ishaan Kumar Descript, Inc. Kundan Kumar Descript, Inc.
Pseudocode	No	Appendix A provides mathematical equations for a modified codebook learning algorithm, but not structured pseudocode or an algorithm block.
Open Source Code	Yes	We provide code 1, models, and audio samples 2 that we encourage the reader to listen to. 1https://github.com/descriptinc/descript-audio-codec
Open Datasets	Yes	We train our model on a large dataset compiled of speech, music, and environmental sounds. For speech, we use the DAPS dataset [26], the clean speech segments from DNS Challenge 4 [10], the Common Voice dataset [2], and the VCTK dataset [40]. For music, we use the MUSDB dataset [31], and the Jamendo dataset [4]. Finally, for environmental sound, we use both the balanced and unbalanced train segments from Audio Set [14].
Dataset Splits	No	The paper does not explicitly describe a validation dataset split for hyperparameter tuning or model selection. It mentions training data and test data.
Hardware Specification	No	The paper mentions training on 'a single GPU' but does not provide specific details such as the model, memory, or manufacturer of the GPU.
Software Dependencies	No	The paper mentions using the 'Adam W optimizer [23]' and concepts like 'Python 3.8', but it does not specify version numbers for any key software libraries, frameworks, or solvers beyond general programming language versions.
Experiment Setup	Yes	For our ablation study, we train each model with a batch size of 12 for 250k iterations. For our final model, we train with a batch size of 72 for 400k iterations. We train with excerpts of duration 0.38s. We use the Adam W optimizer [23] with a learning rate of 1e 4, β1 = 0.8, and β2 = 0.9, for both the generator and the discriminator. We decay the learning rate at every step, with γ = 0.999996.