Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
Authors: Jaehyeon Kim, Sungwon Kim, Jungil Kong, Sungroh Yoon
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the proposed methods, we conduct experiments on two different datasets. For the single speaker setting, a single female speaker dataset, LJSpeech [8], is used, which consists of 13,100 short audio clips with a total duration of approximately 24 hours. We randomly split the dataset into the training set (12,500 samples), validation set (100 samples), and test set (500 samples). For the multi-speaker setting, the train-clean-100 subset of the Libri TTS corpus [33] is used, which consists of audio recordings of 247 speakers with a total duration of about 54 hours. We compare Glow-TTS with the best publicly available autoregressive TTS model, Tacotron 2 [26]. We measure the mean opinion score (MOS) via Amazon Mechanical Turk to compare the quality of all audio clips, including ground truth (GT), and the synthesized samples; 50 sentences are randomly chosen from the test set for the evaluation. |
| Researcher Affiliation | Collaboration | Jaehyeon Kim Kakao Enterprise jay.xyz@kakaoenterprise.com Sungwon Kim Data Science & AI Lab. Seoul National University ksw0306@snu.ac.kr Jungil Kong Kakao Enterprise henry.k@kakaoenterprise.com Sungroh Yoon Data Science & AI Lab. Seoul National University sryoon@snu.ac.kr |
| Pseudocode | Yes | We present our alignment search algorithm in Algorithm 1. |
| Open Source Code | Yes | Our source code2 and synthesized audio samples3 are publicly available. [Footnote 2: https://github.com/jaywalnut310/glow-tts.] |
| Open Datasets | Yes | For the single speaker setting, a single female speaker dataset, LJSpeech [8], is used... For the multi-speaker setting, the train-clean-100 subset of the Libri TTS corpus [33] is used... |
| Dataset Splits | Yes | We randomly split the dataset into the training set (12,500 samples), validation set (100 samples), and test set (500 samples). We then split it into the training (29,181 samples), validation (88 samples), and test sets (442 samples). |
| Hardware Specification | Yes | This required only 3 days with mixed precision training on two NVIDIA V100 GPUs. All multi-speaker models were trained for 960K iterations on four NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using Adam optimizer and Wave Glow vocoder, but does not specify their version numbers or any other specific software dependencies with versions. |
| Experiment Setup | Yes | During training, we simply set the standard deviation σ of the learnable prior to be a constant 1. Glow-TTS was trained for 240K iterations using the Adam optimizer [11] with the Noam learning rate schedule [31]. We compare Glow-TTS with the best publicly available autoregressive TTS model, Tacotron 2 [26]. For all the experiments, phonemes are chosen as input text tokens. We follow the configuration for the mel-spectrogram of [27]... To train muli-speaker Glow-TTS, we add the speaker embedding and increase the hidden dimension. The speaker embedding is applied in all affine coupling layers of the decoder as a global conditioning [29]. The rest of the settings are the same as for the single speaker setting. |