Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
High-Fidelity Audio Compression with Improved RVQGAN
Authors: Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice |
| Researcher Affiliation | Industry | Rithesh Kumar* Descript, Inc. Prem Seetharaman* Descript, Inc. Alejandro Luebs Descript, Inc. Ishaan Kumar Descript, Inc. Kundan Kumar Descript, Inc. |
| Pseudocode | No | Appendix A provides mathematical equations for a modified codebook learning algorithm, but not structured pseudocode or an algorithm block. |
| Open Source Code | Yes | We provide code 1, models, and audio samples 2 that we encourage the reader to listen to. 1https://github.com/descriptinc/descript-audio-codec |
| Open Datasets | Yes | We train our model on a large dataset compiled of speech, music, and environmental sounds. For speech, we use the DAPS dataset [26], the clean speech segments from DNS Challenge 4 [10], the Common Voice dataset [2], and the VCTK dataset [40]. For music, we use the MUSDB dataset [31], and the Jamendo dataset [4]. Finally, for environmental sound, we use both the balanced and unbalanced train segments from Audio Set [14]. |
| Dataset Splits | No | The paper does not explicitly describe a validation dataset split for hyperparameter tuning or model selection. It mentions training data and test data. |
| Hardware Specification | No | The paper mentions training on 'a single GPU' but does not provide specific details such as the model, memory, or manufacturer of the GPU. |
| Software Dependencies | No | The paper mentions using the 'Adam W optimizer [23]' and concepts like 'Python 3.8', but it does not specify version numbers for any key software libraries, frameworks, or solvers beyond general programming language versions. |
| Experiment Setup | Yes | For our ablation study, we train each model with a batch size of 12 for 250k iterations. For our final model, we train with a batch size of 72 for 400k iterations. We train with excerpts of duration 0.38s. We use the Adam W optimizer [23] with a learning rate of 1e 4, β1 = 0.8, and β2 = 0.9, for both the generator and the discriminator. We decay the learning rate at every step, with γ = 0.999996. |