reproducibilityindex.ai

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Authors: Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score. ... LJSpeech dataset (Ito, 2017) containing approximately 24 hours of English female voice recordings sampled at 22.05k Hz was used to train the Grad-TTS model. The test set contained around 500 short audio recordings (duration less than 10 seconds each). ... We take an ofﬁcial implementation of Glow-TTS (Kim et al., 2020), the model which resembles ours to the most extent among the existing feature generators, Fast Speech (Ren et al., 2019), and state-of-the-art Tacotron2 (Shen et al., 2018).
Researcher Affiliation	Collaboration	1Huawei Noah s Ark Lab, Moscow, Russia 2Higher School of Economics, Moscow, Russia.
Pseudocode	No	The paper describes procedures and equations but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	The code will soon be available at https://github.com/huawei-noah/ speech-backbones.
Open Datasets	Yes	LJSpeech dataset (Ito, 2017) containing approximately 24 hours of English female voice recordings sampled at 22.05k Hz was used to train the Grad-TTS model.
Dataset Splits	No	The paper mentions a test set but does not specify the training/validation/test dataset splits needed for reproduction.
Hardware Specification	Yes	Grad-TTS was trained for 1.7m iterations on a single GPU (NVIDIA RTX 2080 Ti with 11GB memory) with mini-batch size 16.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers, such as Python library versions or framework versions.
Experiment Setup	Yes	Grad-TTS was trained for 1.7m iterations on a single GPU (NVIDIA RTX 2080 Ti with 11GB memory) with mini-batch size 16. We chose Adam optimizer and set the learning rate to 0.0001. ... We chose T = 1, βt = β0 + (β1 β0)t, β0 = 0.05 and β1 = 20. ... We use τ = 1.5 at synthesis for all four models.