Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Authors: Jaehyeon Kim, Jungil Kong, Juhee Son

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted experiments on two different datasets. We used the LJ Speech dataset (Ito, 2017) for comparison with other publicly available models and the VCTK dataset (Veaux et al., 2017) to verify whether our model can learn and express diverse speech characteristics.
Researcher Affiliation Collaboration 1Kakao Enterprise, Seongnam-si, Gyeonggi-do, Republic of Korea 2School of Computing, KAIST, Daejeon, Republic of Korea.
Pseudocode Yes Appendix A includes pseudocode for MAS.
Open Source Code Yes We make both our demo page and source-code publicly available.2 Source-code: https://github.com/jaywalnut310/vits Demo: https://jaywalnut310.github.io/vits-demo/index.html
Open Datasets Yes We used the LJ Speech dataset (Ito, 2017) for comparison with other publicly available models and the VCTK dataset (Veaux et al., 2017) to verify whether our model can learn and express diverse speech characteristics.
Dataset Splits Yes We randomly split the dataset into a training set (12,500 samples), validation set (100 samples), and test set (500 samples). The VCTK dataset... We randomly split the dataset into a training set (43,470 samples), validation set (100 samples), and test set (500 samples).
Hardware Specification Yes We use mixed precision training on 4 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using “Adam W optimizer (Loshchilov & Hutter, 2019)” and “open-source software (Bernard, 2021)” for IPA phoneme sequences. However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, or other libraries used in the implementation.
Experiment Setup Yes The networks are trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.8, β2 = 0.99 and weight decay λ = 0.01. The learning rate decay is scheduled by a 0.9991/8 factor in every epoch with an initial learning rate of 2 10 4. ... The batch size is set to 64 per GPU and the model is trained up to 800k steps.