reproducibilityindex.ai

Deep Voice: Real-time Neural Text-to-Speech

Authors: Sercan Ö. Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, Shubho Sengupta, Mohammad Shoeybi

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train on 8 Titan X Maxwell GPUs, splitting each batch equally among the GPUs and using a ring all-reduce to average gradients computed on different GPUs, with each iteration taking approximately 1300 milliseconds. After approximately 14,000 iterations, the model converges to a phoneme pair error rate of 7%. We also find that phoneme boundaries do not have to be precise, and randomly shifting phoneme boundaries by 10-30 milliseconds makes no difference in the audio quality, and so suspect that audio quality is insensitive to the phoneme pair error rate past a certain point.
Researcher Affiliation	Industry	1Baidu Silicon Valley Artificial Intelligence Lab, 1195 Bordeaux Dr. Sunnyvale, CA 94089 2Baidu Corporation, No. 10 Xibeiwang East Road, Beijing 100193, China.
Pseudocode	No	The paper describes the steps for its CPU implementation in Section 5.1 using a numbered list, but this is descriptive text and not formatted as a pseudocode block or algorithm.
Open Source Code	No	The paper does not contain any statement about making the source code available or provide a link to a code repository.
Open Datasets	Yes	In addition, we present audio synthesis results for our models trained on a subset of the Blizzard 2013 data (Prahallad et al., 2013).
Dataset Splits	No	The paper describes data preparation and mentions training and testing phases, but it does not explicitly provide specific training/validation/test split percentages, sample counts, or refer to predefined splits with citations for validation data.
Hardware Specification	Yes	We train on 8 Titan X Maxwell GPUs, splitting each batch equally among the GPUs and using a ring all-reduce to average gradients computed on different GPUs, with each iteration taking approximately 1300 milliseconds. ... CPU results are from a Intel Xeon E5-2660 v3 Haswell processor clocked at 2.6 GHz and GPU results are from a Ge Force GTX Titan X Maxwell GPU.
Software Dependencies	No	All of our models are implemented using the Tensor Flow framework (Abadi et al., 2015). The paper mentions TensorFlow but does not specify a version number or any other software dependencies with their versions.
Experiment Setup	Yes	For training, we use the Adam optimization algorithm with β1 = 0.9, β2 = 0.999, " = 10 8, a batch size of 64, a learning rate of 10 3, and an annealing rate of 0.85 applied every 1000 iterations (Kingma & Ba, 2014).