Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech

Authors: Yoonhyung Lee, Joongbo Shin, Kyomin Jung

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments conducted on LJSpeech dataset, we show that our model generates a mel-spectrogram 27 times faster than Tacotron 2 with similar speech quality.
Researcher Affiliation Academia Yoonhyung Lee, Joongbo Shin, Kyomin Jung Department of Electrical and Computer Engineering Seoul National University Seoul, South Korea {cpi1234, jbshin, kjung}@snu.ac.kr
Pseudocode Yes In Appendix A.1, pseudo-codes for the training and inference of BVAE-TTS are contained for detailed descriptions.
Open Source Code Yes 1https://github.com/LEEYOONHYUNG/BVAE-TTS
Open Datasets Yes In the experiments, we mainly use the LJSpeech dataset (Ito & Johnson, 2017) consisting of 12500 / 100 / 500 samples for training / validation / test splits, respectively.
Dataset Splits Yes In the experiments, we mainly use the LJSpeech dataset (Ito & Johnson, 2017) consisting of 12500 / 100 / 500 samples for training / validation / test splits, respectively.
Hardware Specification Yes Training of BVAE-TTS takes about 48 hours on Intel(R) Xeon(R) Gold 5120 CPU (2.2GHz) and NVIDIA V100 GPU on the Pytorch 1.16.0 library with Python 3.6.10 over the Ubuntu 16.04 LTS.
Software Dependencies Yes Training of BVAE-TTS takes about 48 hours on Intel(R) Xeon(R) Gold 5120 CPU (2.2GHz) and NVIDIA V100 GPU on the Pytorch 1.16.0 library with Python 3.6.10 over the Ubuntu 16.04 LTS.
Experiment Setup Yes We train the BVAE-TTS consisting of 4 BVAE blocks for 300K iterations with a batch size of 128. For an optimizer, we use the Adamax Optimizer (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.999 using the learning rate scheduling used in (Vaswani et al., 2017), where initial learning rate of 1e3 and warm-up step of 4000 are used.