Efficient Neural Audio Synthesis
Authors: Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, Koray Kavukcuoglu
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark all models on a single-speaker North-American English text-to-speech dataset where the input is composed of predicted linguistic feature vectors and the output is the raw 24 k Hz, 16-bit waveform (Section 5). We report the Negative Log-Likelihood (NLL) reached by a model on held-out data, the results of A/B comparison tests between a pair of models as rated by human listeners and Mean Opinion Scores (MOS) for the samples of a model. |
| Researcher Affiliation | Industry | 1Deep Mind 2Google Brain. Correspondence to: Nal Kalchbrenner <nalk@google.com>. |
| Pseudocode | No | The paper provides mathematical equations and architectural diagrams (Figure 1), but no structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described. |
| Open Datasets | Yes | Text-to-speech models were trained on a dataset of 44 hours of North American English speech recorded by a professional speaker (van den Oord et al., 2017). |
| Dataset Splits | No | The paper mentions training on sequences of 960 audio samples and evaluation on a held-out test set, but does not provide specific details on train/validation/test splits (percentages, counts, or explicit splitting methodology). |
| Hardware Specification | Yes | We implement and benchmark the sparse matrix-vector products and non-linearities used in the Wave RNN on a mobile CPU (Table 2). Even though the amounts of computation and memory bandwidth are, respectively, three and two orders of magnitude smaller on a mobile CPU than on a GPU, our benchmarks on off-the-shelf mobile CPUs indicate that the resources are sufficient for real-time on-device audio synthesis with a high-quality Sparse Wave RNN. ... A Fused variant of Subscale Wave RNN also gives a sampling speed of 10 real time on a Nvidia P100 GPU using a slight modification of the GPU kernel for Wave RNN-896. We perform our benchmarks on the Snapdragon 808 (SD 808) and Snapdragon 835 (SD 835) mobile CPUs, which are widely available in mobile phones. |
| Software Dependencies | No | The paper mentions a "regular Tensorflow implementation" but does not specify version numbers for Tensorflow or other software components, which is needed for reproducibility. |
| Experiment Setup | Yes | The Wave RNN models are trained on sequences of 960 audio samples of 16-bit each and full back-propagationthrough-time is applied to the models. We use t0 = 1000, S = 200k and train for a total of 500k steps for all models. The conditioning network of the Subscale Wave RNN is a masked dilated 1D CNN and has ten layers, convolutional kernels of size 3, 384 convolutional channels, and 768 residual channels. The conditioning CNN has 5 stages of increasing dilation, for a total future horizon of F = 128 blocks of 8 or 16 samples each. |