Deep Voice: Real-time Neural Text-to-Speech
Authors: Sercan Ö. Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, Shubho Sengupta, Mohammad Shoeybi
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train on 8 Titan X Maxwell GPUs, splitting each batch equally among the GPUs and using a ring all-reduce to average gradients computed on different GPUs, with each iteration taking approximately 1300 milliseconds. After approximately 14,000 iterations, the model converges to a phoneme pair error rate of 7%. We also find that phoneme boundaries do not have to be precise, and randomly shifting phoneme boundaries by 10-30 milliseconds makes no difference in the audio quality, and so suspect that audio quality is insensitive to the phoneme pair error rate past a certain point. |
| Researcher Affiliation | Industry | 1Baidu Silicon Valley Artificial Intelligence Lab, 1195 Bordeaux Dr. Sunnyvale, CA 94089 2Baidu Corporation, No. 10 Xibeiwang East Road, Beijing 100193, China. |
| Pseudocode | No | The paper describes the steps for its CPU implementation in Section 5.1 using a numbered list, but this is descriptive text and not formatted as a pseudocode block or algorithm. |
| Open Source Code | No | The paper does not contain any statement about making the source code available or provide a link to a code repository. |
| Open Datasets | Yes | In addition, we present audio synthesis results for our models trained on a subset of the Blizzard 2013 data (Prahallad et al., 2013). |
| Dataset Splits | No | The paper describes data preparation and mentions training and testing phases, but it does not explicitly provide specific training/validation/test split percentages, sample counts, or refer to predefined splits with citations for validation data. |
| Hardware Specification | Yes | We train on 8 Titan X Maxwell GPUs, splitting each batch equally among the GPUs and using a ring all-reduce to average gradients computed on different GPUs, with each iteration taking approximately 1300 milliseconds. ... CPU results are from a Intel Xeon E5-2660 v3 Haswell processor clocked at 2.6 GHz and GPU results are from a Ge Force GTX Titan X Maxwell GPU. |
| Software Dependencies | No | All of our models are implemented using the Tensor Flow framework (Abadi et al., 2015). The paper mentions TensorFlow but does not specify a version number or any other software dependencies with their versions. |
| Experiment Setup | Yes | For training, we use the Adam optimization algorithm with β1 = 0.9, β2 = 0.999, " = 10 8, a batch size of 64, a learning rate of 10 3, and an annealing rate of 0.85 applied every 1000 iterations (Kingma & Ba, 2014). |