reproducibilityindex.ai

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Authors: Rafael Valle, Kevin J. Shih, Ryan Prenger, Bryan Catanzaro

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section describes our training setup and provides quantitative and qualitative results. Our quantitative results show that Flowtron has mean opinion scores that are comparable to the state of the art. Our qualitative results demonstrate many features that are either impossible or inefﬁcient to achieve using Tacotron, Tacotron 2 GST and Tacotron GM-VAE.
Researcher Affiliation	Industry	Rafael Valle, Kevin J. Shih, Ryan Prenger & Bryan Catanzaro NVIDIA rafaelvalle@nvidia.com
Pseudocode	Yes	Algorithm 1: Flowtron Posterior inference
Open Source Code	Yes	Code and pre-trained models are publicly available at https://github.com/NVIDIA/ﬂowtron.
Open Datasets	Yes	We train Flowtron, Tacotron 2 and Tacotron 2 GST models using a dataset (LSH) that combines the LJSpeech dataset (Ito et al., 2017) with two proprietary single speaker datasets with 20 and 10 hours each (Sally and Helen). We also train a Flowtron model on the train-clean-100 subset of Libri TTS (Zen et al., 2019) with 123 speakers and 25 minutes on average per speaker.
Dataset Splits	Yes	For each dataset, we use at least 180 samples for the validation set, and the remainder for the training set.
Hardware Specification	Yes	Each model is trained on a single NVIDIA DGX-1 with 8 GPUs.
Software Dependencies	No	The paper mentions software like 'Wave Glow' and 'ADAM optimizer' but does not provide specific version numbers for these or other key software components, which is required for reproducibility.
Experiment Setup	Yes	We use the ADAM (Kingma & Ba, 2014) optimizer with default parameters, 1e-4 learning rate and1e-6 weight decay for Flowtron and 1e-3 learning rate and 1e-5 weight decay for the other models,following Wang et al. (2017). Flowtron models with 2 steps of ﬂow were trained on the LSH dataset for approximately1000 epochs, then ﬁne-tuned on Libri TTS for 500 epochs. Tacotron 2 and Tacotron 2 GST are trained for approximately 500 epochs.