EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

Authors: Chenfeng Miao, Liang Shuang, Zhengchen Liu, Chen Minchuan, Jun Ma, Shaojun Wang, Jing Xiao

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally show that the proposed models significantly outperform counterpart models such as Tacotron 2 (Shen et al.) and Glow-TTS (Kim et al., 2020) in terms of speech quality, training efficiency and synthesis speed, while still producing the speeches of strong robustness and great diversity.
Researcher Affiliation Industry 1Ping An Technology. Correspondence to: Chenfeng Miao <miao chenfeng@126.com>.
Pseudocode Yes We show the implementation of each components in the following subsections and more details including the pseudocode in Appendix B.
Open Source Code No Audio samples of the proposed models are available at: https://mcf330.github.io/ Efficient TTSAudio Samples/. No explicit statement about releasing the source code for the methodology or a link to a code repository.
Open Datasets Yes We conduct most of our experiments on an open-source standard Mandarin dataset from Data Baker2, which consists of 10, 000 Chinese clips from a single female speaker with a sampling rate of 22.05k HZ. ... We also conduct some experiments using LJ-Speech dataset (Ito, 2017), which is a 24-hour waveform audio set of a single female speaker with 131,00 audio clips and a sample rate of 22.05k HZ. 2https://www.data-baker.com/open_source. html
Dataset Splits No The paper mentions using Data Baker and LJ-Speech datasets, but does not explicitly state the proportions or counts for training, validation, and test splits. It only gives the total size of the datasets (e.g., "10,000 Chinese clips", "24-hour waveform audio set").
Hardware Specification Yes We run training and inference on a single V100 GPU.
Software Dependencies No The paper mentions using "Hi Fi-GAN (Kong et al., 2020) vocoder", "open-source implementations of Tacotron 2", and "Glow-TTS" but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes The temperature of latent variable z is set to 0.667 for both Glow-TTS and EFTS-Flow. ... η is a hyper-parameter which we set to 1.2 for all experiments.