reproducibilityindex.ai

Efficient Neural Audio Synthesis

Authors: Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, Koray Kavukcuoglu

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark all models on a single-speaker North-American English text-to-speech dataset where the input is composed of predicted linguistic feature vectors and the output is the raw 24 k Hz, 16-bit waveform (Section 5). We report the Negative Log-Likelihood (NLL) reached by a model on held-out data, the results of A/B comparison tests between a pair of models as rated by human listeners and Mean Opinion Scores (MOS) for the samples of a model.
Researcher Affiliation	Industry	1Deep Mind 2Google Brain. Correspondence to: Nal Kalchbrenner <nalk@google.com>.
Pseudocode	No	The paper provides mathematical equations and architectural diagrams (Figure 1), but no structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any concrete access to source code for the methodology described.
Open Datasets	Yes	Text-to-speech models were trained on a dataset of 44 hours of North American English speech recorded by a professional speaker (van den Oord et al., 2017).
Dataset Splits	No	The paper mentions training on sequences of 960 audio samples and evaluation on a held-out test set, but does not provide specific details on train/validation/test splits (percentages, counts, or explicit splitting methodology).
Hardware Specification	Yes	We implement and benchmark the sparse matrix-vector products and non-linearities used in the Wave RNN on a mobile CPU (Table 2). Even though the amounts of computation and memory bandwidth are, respectively, three and two orders of magnitude smaller on a mobile CPU than on a GPU, our benchmarks on off-the-shelf mobile CPUs indicate that the resources are sufﬁcient for real-time on-device audio synthesis with a high-quality Sparse Wave RNN. ... A Fused variant of Subscale Wave RNN also gives a sampling speed of 10 real time on a Nvidia P100 GPU using a slight modiﬁcation of the GPU kernel for Wave RNN-896. We perform our benchmarks on the Snapdragon 808 (SD 808) and Snapdragon 835 (SD 835) mobile CPUs, which are widely available in mobile phones.
Software Dependencies	No	The paper mentions a "regular Tensorﬂow implementation" but does not specify version numbers for Tensorﬂow or other software components, which is needed for reproducibility.
Experiment Setup	Yes	The Wave RNN models are trained on sequences of 960 audio samples of 16-bit each and full back-propagationthrough-time is applied to the models. We use t0 = 1000, S = 200k and train for a total of 500k steps for all models. The conditioning network of the Subscale Wave RNN is a masked dilated 1D CNN and has ten layers, convolutional kernels of size 3, 384 convolutional channels, and 768 residual channels. The conditioning CNN has 5 stages of increasing dilation, for a total future horizon of F = 128 blocks of 8 or 16 samples each.