A Spectral Energy Distance for Parallel Speech Synthesis
Authors: Alexey Gritsenko, Tim Salimans, Rianne van den Berg, Jasper Snoek, Nal Kalchbrenner
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently-proposed c FDSD metric. When combining our method with adversarial techniques, we also improve upon the recently-proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators. We evaluate our proposed approach to speech synthesis by training 4 different models on the data set described in the previous section: |
| Researcher Affiliation | Industry | Alexey A. Gritsenko Tim Salimans Rianne van den Berg Jasper Snoek Nal Kalchbrenner {agritsenko,salimans,riannevdberg,jsnoek,nalk}@google.com Google Research |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | An open source implementation of our generalized energy distance is available at https://github.com/google-research/google-research/ tree/master/ged_tts. |
| Open Datasets | No | Our TTS models are trained on a single-speaker North American English dataset, consisting of speech data spoken by a professional voice actor. The data consists of approximately sixty thousand utterances with durations ranging approximately from 0.5 seconds to 1 minute, and their corresponding aligned linguistic features and pitch information. The paper does not provide a specific name, URL, DOI, or formal citation (with authors and year for this dataset) for public access. |
| Dataset Splits | Yes | We report these metrics on both the training data as well as a larger validation set. We verified that this improvement is not due to overfitting by re-training the models on a smaller dataset (38.6 hours) and re-computing these metrics on a larger validation set (5.8 hours). |
| Hardware Specification | Yes | All of our models are trained on Cloud TPUs v3 with hyperparameters as described in Table 5 of the Appendix. |
| Software Dependencies | No | The paper mentions 'Our model parameters are updated using Adam [17]' but does not provide specific version numbers for software libraries, frameworks (e.g., TensorFlow, PyTorch), or programming languages used for implementation. |
| Experiment Setup | Yes | All of our models are trained on Cloud TPUs v3 with hyperparameters as described in Table 5 of the Appendix. Our model parameters are updated using Adam [17]. |