reproducibilityindex.ai

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

Authors: Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation of Fast Diff demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech samples. Also, Fast Diff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. We further show that Fast Diff generalized well to the mel-spectrogram inversion of unseen speakers, and Fast Diff-TTS outperformed other competing methods in end-to-end text-to-speech synthesis. Audio samples are available at https://Fast Diff.github.io/.
Researcher Affiliation	Collaboration	Rongjie Huang1 , Max W. Y. Lam2 , Jun Wang2 , Dan Su2 , Dong Yu3 , Yi Ren1 , Zhou Zhao1 1Zhejiang University 2Tencent AI Lab, China 3Tencent AI Lab, USA
Pseudocode	Yes	Algorithm 1 Training refinement network θ, Algorithm 2 Training noise predictor ϕ, Algorithm 3 Sampling
Open Source Code	No	The paper provides a link for audio samples (https://Fast Diff.github.io/) but does not include an explicit statement or link to the source code for the described methodology.
Open Datasets	Yes	For a fair and reproducible comparison against other competing methods, we used the benchmark LJSpeech dataset [Ito and Johnson, 2017]. To evaluate the generalization ability of our model over unseen speakers in multi-speaker scenarios, we also used the VCTK dataset [Yamagishi et al., 2019]
Dataset Splits	No	The paper mentions using LJSpeech and VCTK datasets and details of training steps and batch sizes, but it does not specify the exact percentages or counts for training, validation, and test splits for general model training.
Hardware Specification	Yes	Fast Diff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. Both models were trained on 4 NVIDIA V100 GPUs using random short audio clips of 16,000 samples from each utterance with a batch size of 16 each GPU. To evaluate the sampling speed, we implemented real-time factor accessment on a single NVIDIA V100 GPU.
Software Dependencies	No	The paper mentions the use of Adam W optimizer, but it does not provide specific version numbers for software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or other specific solvers.
Experiment Setup	Yes	Fast Diff was trained with constant learning rate lr = 2e-4. The refinement model θ and noise predictor ϕ were trained for 1M and 10K steps until convergence, respectively. Fast Diff-TTS was trained until 500k steps using the Adam W optimizer with β1 = 0.9, β2 = 0.98, ϵ = 10e-9. Both models were trained on 4 NVIDIA V100 GPUs using random short audio clips of 16,000 samples from each utterance with a batch size of 16 each GPU.