Efficient Neural Audio Synthesis

Authors: Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, Koray Kavukcuoglu

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark all models on a single-speaker North-American English text-to-speech dataset where the input is composed of predicted linguistic feature vectors and the output is the raw 24 k Hz, 16-bit waveform (Section 5). We report the Negative Log-Likelihood (NLL) reached by a model on held-out data, the results of A/B comparison tests between a pair of models as rated by human listeners and Mean Opinion Scores (MOS) for the samples of a model.
Researcher Affiliation Industry 1Deep Mind 2Google Brain. Correspondence to: Nal Kalchbrenner <nalk@google.com>.
Pseudocode No The paper provides mathematical equations and architectural diagrams (Figure 1), but no structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code for the methodology described.
Open Datasets Yes Text-to-speech models were trained on a dataset of 44 hours of North American English speech recorded by a professional speaker (van den Oord et al., 2017).
Dataset Splits No The paper mentions training on sequences of 960 audio samples and evaluation on a held-out test set, but does not provide specific details on train/validation/test splits (percentages, counts, or explicit splitting methodology).
Hardware Specification Yes We implement and benchmark the sparse matrix-vector products and non-linearities used in the Wave RNN on a mobile CPU (Table 2). Even though the amounts of computation and memory bandwidth are, respectively, three and two orders of magnitude smaller on a mobile CPU than on a GPU, our benchmarks on off-the-shelf mobile CPUs indicate that the resources are sufficient for real-time on-device audio synthesis with a high-quality Sparse Wave RNN. ... A Fused variant of Subscale Wave RNN also gives a sampling speed of 10 real time on a Nvidia P100 GPU using a slight modification of the GPU kernel for Wave RNN-896. We perform our benchmarks on the Snapdragon 808 (SD 808) and Snapdragon 835 (SD 835) mobile CPUs, which are widely available in mobile phones.
Software Dependencies No The paper mentions a "regular Tensorflow implementation" but does not specify version numbers for Tensorflow or other software components, which is needed for reproducibility.
Experiment Setup Yes The Wave RNN models are trained on sequences of 960 audio samples of 16-bit each and full back-propagationthrough-time is applied to the models. We use t0 = 1000, S = 200k and train for a total of 500k steps for all models. The conditioning network of the Subscale Wave RNN is a masked dilated 1D CNN and has ten layers, convolutional kernels of size 3, 384 convolutional channels, and 768 residual channels. The conditioning CNN has 5 stages of increasing dilation, for a total future horizon of F = 128 blocks of 8 or 16 samples each.