BigVGAN: A Universal Neural Vocoder with Large-Scale Training

Authors: Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a comprehensive evaluation of Big VGAN for both in-distribution and out-of-distribution scenarios. We train Big VGAN and all baseline models on the full Libri TTS dataset. ... We report the performance of Big VGAN and the baseline models evaluated on Libri TTS using above objective and subjective metrics. ... Table 2 shows the in-distribution test results on Libri TTS. ... Table 3 summarizes the SMOS results from three different types of unseen dataset. ... Table 4 shows the SMOS results from the 5 tracks and their average from the MUSDB18-HQ test set. ... Table 5: Ablation results on training data diversity using 112M Big VGAN model, evaluated on Libri TTS.
Researcher Affiliation Collaboration Sang-gil Lee1 Wei Ping 2 Boris Ginsburg2 Bryan Catanzaro2 Sungroh Yoon1,3 1 Data Science & AI Lab, Seoul National University (SNU) 2 NVIDIA 3 AIIS, ASRI, INMC, ISRC, NSI, and Interdisciplinary Program in AI, SNU
Pseudocode No The paper provides detailed architectural diagrams and textual descriptions of the model components but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes We release our code and model at: https://github.com/NVIDIA/BigVGAN.
Open Datasets Yes We use Libri TTS (Zen et al., 2019) dataset with the original sampling rate of 24 k Hz for training.
Dataset Splits Yes We perform objective evaluations on dev-clean and dev-other altogether, and conduct subjective evaluations on the combined test-clean and test-other.
Hardware Specification Yes Table 1: Model footprint and synthesis speed for 24 k Hz audio measured on an NVIDIA RTX 8000 GPU.
Software Dependencies No No specific version numbers are provided for the software dependencies mentioned, such as 'librosa', 'Auraloss', 'python-pesq', 'python-MCD', 'CARGAN' code, or 'NVIDIA Ne Mo' toolkit.
Experiment Setup Yes We train all Big VGAN models including the ablation models and the baseline Hi Fi-GAN using our training configuration for 1M steps. We use the batch size of 32, a segment size of 8,192, and a initial learning rate of 1 10 4. ... Refer to Table 6 in the Appendix A for detailed hyperparameters.