Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
Authors: Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, John Miller
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present Deep Voice 3, a fully-convolutional attention-based neural textto-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training an order of magnitude faster. We scale Deep Voice 3 to dataset sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on a single GPU server. |
| Researcher Affiliation | Collaboration | Wei Ping , Kainan Peng , Andrew Gibiansky , Sercan O. Arık Ajay Kannan, Sharan Narang Baidu Research {pingwei01, pengkainan, gibianskyandrew, sercanarik, kannanajay, sharan}@baidu.com Jonathan Raiman Open AI raiman@openai.com John Miller University of California, Berkeley miller john@berkeley.edu |
| Pseudocode | No | The paper describes the model architecture and components in detail with text and diagrams (Fig. 1, Fig. 2, Fig. 3, Fig. 5, Fig. 6) but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions implementation details and custom GPU kernels in Appendix B, but there is no explicit statement or link indicating that the code for the described methodology is open-source or publicly available. |
| Open Datasets | Yes | Data: For single-speaker synthesis, we use an internal English speech dataset containing approximately 20 hours of audio with a sample rate of 48 k Hz. For multi-speaker synthesis, we use the VCTK (Yamagishi et al., 2009) and Libri Speech (Panayotov et al., 2015) datasets. The VCTK dataset consists of audios for 108 speakers, with a total duration of 44 hours. The Libri Speech dataset consists of audios for 2484 speakers, with a total duration of 820 hours. |
| Dataset Splits | No | The paper mentions the datasets used and a 100-sentence test set, but it does not provide explicit details on the training, validation, or overall test splits (e.g., percentages or exact counts) for the main datasets (VCTK, Libri Speech). |
| Hardware Specification | Yes | On a single Nvidia Tesla P100 GPU with 56 SMs, we achieve an inference speed of 115 QPS, which corresponds to our target ten million queries per day. We parallelize WORLD synthesis across all 20 CPUs on the server, permanently pinning threads to CPUs in order to maximize cache performance. |
| Software Dependencies | No | The paper mentions software like Tensor Flow and CUDA, and a tool 'SoX', but does not provide specific version numbers for any of these or other key software dependencies required for replication. |
| Experiment Setup | Yes | All hyperparameters of the models used in this paper are shown in Table 4. Parameter Single-Speaker VCTK Libri Speech FFT Size 4096 4096 4096 FFT Window Size / Shift 2400 / 600 2400 / 600 1600 / 400 Audio Sample Rate 48000 48000 16000 Reduction Factor r 4 4 4 Mel Bands 80 80 80 Sharpening Factor 1.4 1.4 1.4 Character Embedding Dim. 256 256 256 Encoder Layers / Conv. Width / Channels 7 / 5 / 64 7 / 5 / 128 7 / 5 / 256 Decoder Affine Size 128, 256 128, 256 128, 256 Decoder Layers / Conv. Width 4 / 5 6 / 5 8 / 5 Attention Hidden Size 128 256 256 Position Weight / Initial Rate 1.0 / 6.3 0.1 / 7.6 0.1 / 2.6 Converter Layers / Conv. Width / Channels 5 / 5 / 256 6 / 5 / 256 8 / 5 / 256 Dropout Keep Probability 0.95 0.95 0.99 Number of Speakers 1 108 2484 Speaker Embedding Dim. 16 512 ADAM Learning Rate 0.001 0.0005 0.0005 Anneal Rate / Anneal Interval 0.98 / 30000 0.95 / 30000 Batch Size 16 16 16 Max Gradient Norm 100 100 50.0 Gradient Clipping Max. Value 5 5 5 |