EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture
Authors: Chenfeng Miao, Liang Shuang, Zhengchen Liu, Chen Minchuan, Jun Ma, Shaojun Wang, Jing Xiao
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally show that the proposed models significantly outperform counterpart models such as Tacotron 2 (Shen et al.) and Glow-TTS (Kim et al., 2020) in terms of speech quality, training efficiency and synthesis speed, while still producing the speeches of strong robustness and great diversity. |
| Researcher Affiliation | Industry | 1Ping An Technology. Correspondence to: Chenfeng Miao <miao chenfeng@126.com>. |
| Pseudocode | Yes | We show the implementation of each components in the following subsections and more details including the pseudocode in Appendix B. |
| Open Source Code | No | Audio samples of the proposed models are available at: https://mcf330.github.io/ Efficient TTSAudio Samples/. No explicit statement about releasing the source code for the methodology or a link to a code repository. |
| Open Datasets | Yes | We conduct most of our experiments on an open-source standard Mandarin dataset from Data Baker2, which consists of 10, 000 Chinese clips from a single female speaker with a sampling rate of 22.05k HZ. ... We also conduct some experiments using LJ-Speech dataset (Ito, 2017), which is a 24-hour waveform audio set of a single female speaker with 131,00 audio clips and a sample rate of 22.05k HZ. 2https://www.data-baker.com/open_source. html |
| Dataset Splits | No | The paper mentions using Data Baker and LJ-Speech datasets, but does not explicitly state the proportions or counts for training, validation, and test splits. It only gives the total size of the datasets (e.g., "10,000 Chinese clips", "24-hour waveform audio set"). |
| Hardware Specification | Yes | We run training and inference on a single V100 GPU. |
| Software Dependencies | No | The paper mentions using "Hi Fi-GAN (Kong et al., 2020) vocoder", "open-source implementations of Tacotron 2", and "Glow-TTS" but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | The temperature of latent variable z is set to 0.667 for both Glow-TTS and EFTS-Flow. ... η is a hyper-parameter which we set to 1.2 for all experiments. |