Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture
Authors: Chenfeng Miao, Liang Shuang, Zhengchen Liu, Chen Minchuan, Jun Ma, Shaojun Wang, Jing Xiao
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally show that the proposed models significantly outperform counterpart models such as Tacotron 2 (Shen et al.) and Glow-TTS (Kim et al., 2020) in terms of speech quality, training efficiency and synthesis speed, while still producing the speeches of strong robustness and great diversity. |
| Researcher Affiliation | Industry | 1Ping An Technology. Correspondence to: Chenfeng Miao <miao EMAIL>. |
| Pseudocode | Yes | We show the implementation of each components in the following subsections and more details including the pseudocode in Appendix B. |
| Open Source Code | No | Audio samples of the proposed models are available at: https://mcf330.github.io/ Efficient TTSAudio Samples/. No explicit statement about releasing the source code for the methodology or a link to a code repository. |
| Open Datasets | Yes | We conduct most of our experiments on an open-source standard Mandarin dataset from Data Baker2, which consists of 10, 000 Chinese clips from a single female speaker with a sampling rate of 22.05k HZ. ... We also conduct some experiments using LJ-Speech dataset (Ito, 2017), which is a 24-hour waveform audio set of a single female speaker with 131,00 audio clips and a sample rate of 22.05k HZ. 2https://www.data-baker.com/open_source. html |
| Dataset Splits | No | The paper mentions using Data Baker and LJ-Speech datasets, but does not explicitly state the proportions or counts for training, validation, and test splits. It only gives the total size of the datasets (e.g., "10,000 Chinese clips", "24-hour waveform audio set"). |
| Hardware Specification | Yes | We run training and inference on a single V100 GPU. |
| Software Dependencies | No | The paper mentions using "Hi Fi-GAN (Kong et al., 2020) vocoder", "open-source implementations of Tacotron 2", and "Glow-TTS" but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | The temperature of latent variable z is set to 0.667 for both Glow-TTS and EFTS-Flow. ... η is a hyper-parameter which we set to 1.2 for all experiments. |