Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Authors: Jaehyeon Kim, Sungwon Kim, Jungil Kong, Sungroh Yoon

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the proposed methods, we conduct experiments on two different datasets. For the single speaker setting, a single female speaker dataset, LJSpeech [8], is used, which consists of 13,100 short audio clips with a total duration of approximately 24 hours. We randomly split the dataset into the training set (12,500 samples), validation set (100 samples), and test set (500 samples). For the multi-speaker setting, the train-clean-100 subset of the Libri TTS corpus [33] is used, which consists of audio recordings of 247 speakers with a total duration of about 54 hours. We compare Glow-TTS with the best publicly available autoregressive TTS model, Tacotron 2 [26]. We measure the mean opinion score (MOS) via Amazon Mechanical Turk to compare the quality of all audio clips, including ground truth (GT), and the synthesized samples; 50 sentences are randomly chosen from the test set for the evaluation.
Researcher Affiliation Collaboration Jaehyeon Kim Kakao Enterprise jay.xyz@kakaoenterprise.com Sungwon Kim Data Science & AI Lab. Seoul National University ksw0306@snu.ac.kr Jungil Kong Kakao Enterprise henry.k@kakaoenterprise.com Sungroh Yoon Data Science & AI Lab. Seoul National University sryoon@snu.ac.kr
Pseudocode Yes We present our alignment search algorithm in Algorithm 1.
Open Source Code Yes Our source code2 and synthesized audio samples3 are publicly available. [Footnote 2: https://github.com/jaywalnut310/glow-tts.]
Open Datasets Yes For the single speaker setting, a single female speaker dataset, LJSpeech [8], is used... For the multi-speaker setting, the train-clean-100 subset of the Libri TTS corpus [33] is used...
Dataset Splits Yes We randomly split the dataset into the training set (12,500 samples), validation set (100 samples), and test set (500 samples). We then split it into the training (29,181 samples), validation (88 samples), and test sets (442 samples).
Hardware Specification Yes This required only 3 days with mixed precision training on two NVIDIA V100 GPUs. All multi-speaker models were trained for 960K iterations on four NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using Adam optimizer and Wave Glow vocoder, but does not specify their version numbers or any other specific software dependencies with versions.
Experiment Setup Yes During training, we simply set the standard deviation σ of the learnable prior to be a constant 1. Glow-TTS was trained for 240K iterations using the Adam optimizer [11] with the Noam learning rate schedule [31]. We compare Glow-TTS with the best publicly available autoregressive TTS model, Tacotron 2 [26]. For all the experiments, phonemes are chosen as input text tokens. We follow the configuration for the mel-spectrogram of [27]... To train muli-speaker Glow-TTS, we add the speaker embedding and increase the hidden dimension. The speaker embedding is applied in all affine coupling layers of the decoder as a global conditioning [29]. The rest of the settings are the same as for the single speaker setting.