Deep Voice: Real-time Neural Text-to-Speech

Authors: Sercan Ö. Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, Shubho Sengupta, Mohammad Shoeybi

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train on 8 Titan X Maxwell GPUs, splitting each batch equally among the GPUs and using a ring all-reduce to average gradients computed on different GPUs, with each iteration taking approximately 1300 milliseconds. After approximately 14,000 iterations, the model converges to a phoneme pair error rate of 7%. We also find that phoneme boundaries do not have to be precise, and randomly shifting phoneme boundaries by 10-30 milliseconds makes no difference in the audio quality, and so suspect that audio quality is insensitive to the phoneme pair error rate past a certain point.
Researcher Affiliation Industry 1Baidu Silicon Valley Artificial Intelligence Lab, 1195 Bordeaux Dr. Sunnyvale, CA 94089 2Baidu Corporation, No. 10 Xibeiwang East Road, Beijing 100193, China.
Pseudocode No The paper describes the steps for its CPU implementation in Section 5.1 using a numbered list, but this is descriptive text and not formatted as a pseudocode block or algorithm.
Open Source Code No The paper does not contain any statement about making the source code available or provide a link to a code repository.
Open Datasets Yes In addition, we present audio synthesis results for our models trained on a subset of the Blizzard 2013 data (Prahallad et al., 2013).
Dataset Splits No The paper describes data preparation and mentions training and testing phases, but it does not explicitly provide specific training/validation/test split percentages, sample counts, or refer to predefined splits with citations for validation data.
Hardware Specification Yes We train on 8 Titan X Maxwell GPUs, splitting each batch equally among the GPUs and using a ring all-reduce to average gradients computed on different GPUs, with each iteration taking approximately 1300 milliseconds. ... CPU results are from a Intel Xeon E5-2660 v3 Haswell processor clocked at 2.6 GHz and GPU results are from a Ge Force GTX Titan X Maxwell GPU.
Software Dependencies No All of our models are implemented using the Tensor Flow framework (Abadi et al., 2015). The paper mentions TensorFlow but does not specify a version number or any other software dependencies with their versions.
Experiment Setup Yes For training, we use the Adam optimization algorithm with β1 = 0.9, β2 = 0.999, " = 10 8, a batch size of 64, a learning rate of 10 3, and an annealing rate of 0.85 applied every 1000 iterations (Kingma & Ba, 2014).