Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
Authors: Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score. ... LJSpeech dataset (Ito, 2017) containing approximately 24 hours of English female voice recordings sampled at 22.05k Hz was used to train the Grad-TTS model. The test set contained around 500 short audio recordings (duration less than 10 seconds each). ... We take an official implementation of Glow-TTS (Kim et al., 2020), the model which resembles ours to the most extent among the existing feature generators, Fast Speech (Ren et al., 2019), and state-of-the-art Tacotron2 (Shen et al., 2018). |
| Researcher Affiliation | Collaboration | 1Huawei Noah s Ark Lab, Moscow, Russia 2Higher School of Economics, Moscow, Russia. |
| Pseudocode | No | The paper describes procedures and equations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The code will soon be available at https://github.com/huawei-noah/ speech-backbones. |
| Open Datasets | Yes | LJSpeech dataset (Ito, 2017) containing approximately 24 hours of English female voice recordings sampled at 22.05k Hz was used to train the Grad-TTS model. |
| Dataset Splits | No | The paper mentions a test set but does not specify the training/validation/test dataset splits needed for reproduction. |
| Hardware Specification | Yes | Grad-TTS was trained for 1.7m iterations on a single GPU (NVIDIA RTX 2080 Ti with 11GB memory) with mini-batch size 16. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers, such as Python library versions or framework versions. |
| Experiment Setup | Yes | Grad-TTS was trained for 1.7m iterations on a single GPU (NVIDIA RTX 2080 Ti with 11GB memory) with mini-batch size 16. We chose Adam optimizer and set the learning rate to 0.0001. ... We chose T = 1, βt = β0 + (β1 β0)t, β0 = 0.05 and β1 = 20. ... We use τ = 1.5 at synthesis for all four models. |