Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Authors: Jaehyeon Kim, Jungil Kong, Juhee Son
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted experiments on two different datasets. We used the LJ Speech dataset (Ito, 2017) for comparison with other publicly available models and the VCTK dataset (Veaux et al., 2017) to verify whether our model can learn and express diverse speech characteristics. |
| Researcher Affiliation | Collaboration | 1Kakao Enterprise, Seongnam-si, Gyeonggi-do, Republic of Korea 2School of Computing, KAIST, Daejeon, Republic of Korea. |
| Pseudocode | Yes | Appendix A includes pseudocode for MAS. |
| Open Source Code | Yes | We make both our demo page and source-code publicly available.2 Source-code: https://github.com/jaywalnut310/vits Demo: https://jaywalnut310.github.io/vits-demo/index.html |
| Open Datasets | Yes | We used the LJ Speech dataset (Ito, 2017) for comparison with other publicly available models and the VCTK dataset (Veaux et al., 2017) to verify whether our model can learn and express diverse speech characteristics. |
| Dataset Splits | Yes | We randomly split the dataset into a training set (12,500 samples), validation set (100 samples), and test set (500 samples). The VCTK dataset... We randomly split the dataset into a training set (43,470 samples), validation set (100 samples), and test set (500 samples). |
| Hardware Specification | Yes | We use mixed precision training on 4 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using “Adam W optimizer (Loshchilov & Hutter, 2019)” and “open-source software (Bernard, 2021)” for IPA phoneme sequences. However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, or other libraries used in the implementation. |
| Experiment Setup | Yes | The networks are trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.8, β2 = 0.99 and weight decay λ = 0.01. The learning rate decay is scheduled by a 0.9991/8 factor in every epoch with an initial learning rate of 2 10 4. ... The batch size is set to 64 per GPU and the model is trained up to 800k steps. |