FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis
Authors: Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation of Fast Diff demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech samples. Also, Fast Diff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. We further show that Fast Diff generalized well to the mel-spectrogram inversion of unseen speakers, and Fast Diff-TTS outperformed other competing methods in end-to-end text-to-speech synthesis. Audio samples are available at https://Fast Diff.github.io/. |
| Researcher Affiliation | Collaboration | Rongjie Huang1 , Max W. Y. Lam2 , Jun Wang2 , Dan Su2 , Dong Yu3 , Yi Ren1 , Zhou Zhao1 1Zhejiang University 2Tencent AI Lab, China 3Tencent AI Lab, USA |
| Pseudocode | Yes | Algorithm 1 Training refinement network θ, Algorithm 2 Training noise predictor ϕ, Algorithm 3 Sampling |
| Open Source Code | No | The paper provides a link for audio samples (https://Fast Diff.github.io/) but does not include an explicit statement or link to the source code for the described methodology. |
| Open Datasets | Yes | For a fair and reproducible comparison against other competing methods, we used the benchmark LJSpeech dataset [Ito and Johnson, 2017]. To evaluate the generalization ability of our model over unseen speakers in multi-speaker scenarios, we also used the VCTK dataset [Yamagishi et al., 2019] |
| Dataset Splits | No | The paper mentions using LJSpeech and VCTK datasets and details of training steps and batch sizes, but it does not specify the exact percentages or counts for training, validation, and test splits for general model training. |
| Hardware Specification | Yes | Fast Diff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. Both models were trained on 4 NVIDIA V100 GPUs using random short audio clips of 16,000 samples from each utterance with a batch size of 16 each GPU. To evaluate the sampling speed, we implemented real-time factor accessment on a single NVIDIA V100 GPU. |
| Software Dependencies | No | The paper mentions the use of Adam W optimizer, but it does not provide specific version numbers for software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or other specific solvers. |
| Experiment Setup | Yes | Fast Diff was trained with constant learning rate lr = 2e-4. The refinement model θ and noise predictor ϕ were trained for 1M and 10K steps until convergence, respectively. Fast Diff-TTS was trained until 500k steps using the Adam W optimizer with β1 = 0.9, β2 = 0.98, ϵ = 10e-9. Both models were trained on 4 NVIDIA V100 GPUs using random short audio clips of 16,000 samples from each utterance with a batch size of 16 each GPU. |