Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
Authors: Rafael Valle, Kevin J. Shih, Ryan Prenger, Bryan Catanzaro
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section describes our training setup and provides quantitative and qualitative results. Our quantitative results show that Flowtron has mean opinion scores that are comparable to the state of the art. Our qualitative results demonstrate many features that are either impossible or inefficient to achieve using Tacotron, Tacotron 2 GST and Tacotron GM-VAE. |
| Researcher Affiliation | Industry | Rafael Valle, Kevin J. Shih, Ryan Prenger & Bryan Catanzaro NVIDIA rafaelvalle@nvidia.com |
| Pseudocode | Yes | Algorithm 1: Flowtron Posterior inference |
| Open Source Code | Yes | Code and pre-trained models are publicly available at https://github.com/NVIDIA/flowtron. |
| Open Datasets | Yes | We train Flowtron, Tacotron 2 and Tacotron 2 GST models using a dataset (LSH) that combines the LJSpeech dataset (Ito et al., 2017) with two proprietary single speaker datasets with 20 and 10 hours each (Sally and Helen). We also train a Flowtron model on the train-clean-100 subset of Libri TTS (Zen et al., 2019) with 123 speakers and 25 minutes on average per speaker. |
| Dataset Splits | Yes | For each dataset, we use at least 180 samples for the validation set, and the remainder for the training set. |
| Hardware Specification | Yes | Each model is trained on a single NVIDIA DGX-1 with 8 GPUs. |
| Software Dependencies | No | The paper mentions software like 'Wave Glow' and 'ADAM optimizer' but does not provide specific version numbers for these or other key software components, which is required for reproducibility. |
| Experiment Setup | Yes | We use the ADAM (Kingma & Ba, 2014) optimizer with default parameters, 1e-4 learning rate and1e-6 weight decay for Flowtron and 1e-3 learning rate and 1e-5 weight decay for the other models,following Wang et al. (2017). Flowtron models with 2 steps of flow were trained on the LSH dataset for approximately1000 epochs, then fine-tuned on Libri TTS for 500 epochs. Tacotron 2 and Tacotron 2 GST are trained for approximately 500 epochs. |