Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Authors: Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2...We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly. Section 5.1 quantifies the improvement for single speaker TTS through a mean opinion score (MOS) evaluation and Section 5.2 presents the synthesized audio quality of multi-speaker Deep Voice 2 and Tacotron via both MOS evaluation and a multi-speaker discriminator accuracy metric. |
| Researcher Affiliation | Industry | Sercan Ö. Arık sercanarik@baidu.com Gregory Diamos gregdiamos@baidu.com Andrew Gibiansky gibianskyandrew@baidu.com John Miller millerjohn@baidu.com Kainan Peng pengkainan@baidu.com Wei Ping pingwei01@baidu.com Jonathan Raiman jonathanraiman@baidu.com Yanqi Zhou zhouyanqi@baidu.com Baidu Silicon Valley Artificial Intelligence Lab 1195 Bordeaux Dr. Sunnyvale, CA 94089 |
| Pseudocode | No | The paper includes architectural diagrams (Figure 1, Figure 2, Figure 3) but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described, nor does it explicitly state that the code is open-source or available. |
| Open Datasets | Yes | We train Deep Voice 1, Deep Voice 2, and Tacotron on an internal English speech database containing approximately 20 hours of single-speaker data...We train all the aforementioned models on the VCTK dataset with 44 hours of speech, which contains 108 speakers with approximately 400 utterances each. |
| Dataset Splits | No | The paper mentions training models on datasets and evaluating them using MOS, but it does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, and testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions various models and algorithms but does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed for replication. |
| Experiment Setup | Yes | All model hyperparameters are presented in Appendix B. |