reproducibilityindex.ai

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Authors: Jaehyeon Kim, Jungil Kong, Juhee Son

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted experiments on two different datasets. We used the LJ Speech dataset (Ito, 2017) for comparison with other publicly available models and the VCTK dataset (Veaux et al., 2017) to verify whether our model can learn and express diverse speech characteristics.
Researcher Affiliation	Collaboration	1Kakao Enterprise, Seongnam-si, Gyeonggi-do, Republic of Korea 2School of Computing, KAIST, Daejeon, Republic of Korea.
Pseudocode	Yes	Appendix A includes pseudocode for MAS.
Open Source Code	Yes	We make both our demo page and source-code publicly available.2 Source-code: https://github.com/jaywalnut310/vits Demo: https://jaywalnut310.github.io/vits-demo/index.html
Open Datasets	Yes	We used the LJ Speech dataset (Ito, 2017) for comparison with other publicly available models and the VCTK dataset (Veaux et al., 2017) to verify whether our model can learn and express diverse speech characteristics.
Dataset Splits	Yes	We randomly split the dataset into a training set (12,500 samples), validation set (100 samples), and test set (500 samples). The VCTK dataset... We randomly split the dataset into a training set (43,470 samples), validation set (100 samples), and test set (500 samples).
Hardware Specification	Yes	We use mixed precision training on 4 NVIDIA V100 GPUs.
Software Dependencies	No	The paper mentions using “Adam W optimizer (Loshchilov & Hutter, 2019)” and “open-source software (Bernard, 2021)” for IPA phoneme sequences. However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, or other libraries used in the implementation.
Experiment Setup	Yes	The networks are trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.8, β2 = 0.99 and weight decay λ = 0.01. The learning rate decay is scheduled by a 0.9991/8 factor in every epoch with an initial learning rate of 2 10 4. ... The batch size is set to 64 per GPU and the model is trained up to 800k steps.