Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Authors: Rafael Valle, Kevin J. Shih, Ryan Prenger, Bryan Catanzaro

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section describes our training setup and provides quantitative and qualitative results. Our quantitative results show that Flowtron has mean opinion scores that are comparable to the state of the art. Our qualitative results demonstrate many features that are either impossible or inefficient to achieve using Tacotron, Tacotron 2 GST and Tacotron GM-VAE.
Researcher Affiliation Industry Rafael Valle, Kevin J. Shih, Ryan Prenger & Bryan Catanzaro NVIDIA rafaelvalle@nvidia.com
Pseudocode Yes Algorithm 1: Flowtron Posterior inference
Open Source Code Yes Code and pre-trained models are publicly available at https://github.com/NVIDIA/flowtron.
Open Datasets Yes We train Flowtron, Tacotron 2 and Tacotron 2 GST models using a dataset (LSH) that combines the LJSpeech dataset (Ito et al., 2017) with two proprietary single speaker datasets with 20 and 10 hours each (Sally and Helen). We also train a Flowtron model on the train-clean-100 subset of Libri TTS (Zen et al., 2019) with 123 speakers and 25 minutes on average per speaker.
Dataset Splits Yes For each dataset, we use at least 180 samples for the validation set, and the remainder for the training set.
Hardware Specification Yes Each model is trained on a single NVIDIA DGX-1 with 8 GPUs.
Software Dependencies No The paper mentions software like 'Wave Glow' and 'ADAM optimizer' but does not provide specific version numbers for these or other key software components, which is required for reproducibility.
Experiment Setup Yes We use the ADAM (Kingma & Ba, 2014) optimizer with default parameters, 1e-4 learning rate and1e-6 weight decay for Flowtron and 1e-3 learning rate and 1e-5 weight decay for the other models,following Wang et al. (2017). Flowtron models with 2 steps of flow were trained on the LSH dataset for approximately1000 epochs, then fine-tuned on Libri TTS for 500 epochs. Tacotron 2 and Tacotron 2 GST are trained for approximately 500 epochs.