reproducibilityindex.ai

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech

Authors: Yoonhyung Lee, Joongbo Shin, Kyomin Jung

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments conducted on LJSpeech dataset, we show that our model generates a mel-spectrogram 27 times faster than Tacotron 2 with similar speech quality.
Researcher Affiliation	Academia	Yoonhyung Lee, Joongbo Shin, Kyomin Jung Department of Electrical and Computer Engineering Seoul National University Seoul, South Korea {cpi1234, jbshin, kjung}@snu.ac.kr
Pseudocode	Yes	In Appendix A.1, pseudo-codes for the training and inference of BVAE-TTS are contained for detailed descriptions.
Open Source Code	Yes	1https://github.com/LEEYOONHYUNG/BVAE-TTS
Open Datasets	Yes	In the experiments, we mainly use the LJSpeech dataset (Ito & Johnson, 2017) consisting of 12500 / 100 / 500 samples for training / validation / test splits, respectively.
Dataset Splits	Yes	In the experiments, we mainly use the LJSpeech dataset (Ito & Johnson, 2017) consisting of 12500 / 100 / 500 samples for training / validation / test splits, respectively.
Hardware Specification	Yes	Training of BVAE-TTS takes about 48 hours on Intel(R) Xeon(R) Gold 5120 CPU (2.2GHz) and NVIDIA V100 GPU on the Pytorch 1.16.0 library with Python 3.6.10 over the Ubuntu 16.04 LTS.
Software Dependencies	Yes	Training of BVAE-TTS takes about 48 hours on Intel(R) Xeon(R) Gold 5120 CPU (2.2GHz) and NVIDIA V100 GPU on the Pytorch 1.16.0 library with Python 3.6.10 over the Ubuntu 16.04 LTS.
Experiment Setup	Yes	We train the BVAE-TTS consisting of 4 BVAE blocks for 300K iterations with a batch size of 128. For an optimizer, we use the Adamax Optimizer (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.999 using the learning rate scheduling used in (Vaswani et al., 2017), where initial learning rate of 1e3 and warm-up step of 4000 are used.