BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Authors: Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted a series of experiments on neural vocoding tasks to evaluate the proposed BDDMs.
Researcher Affiliation Industry Max W. Y. Lam, Jun Wang, Dan Su Tencent AI Lab Shenzhen, China {maxwylam, joinerwang, dansu}@tencent.com Dong Yu Tencent AI Lab Bellevue WA, USA dyu@tencent.com
Pseudocode Yes Algorithm 1 Training Score Network (θ)
Open Source Code Yes We release our code at https://github.com/tencent-ailab/bddm.
Open Datasets Yes we used the LJSpeech dataset (Ito & Johnson, 2017), which consists of 13,100 22k Hz audio clips of a female speaker. We also replicated the comparative experiment of neural vocoding using a multi-speaker VCTK dataset (Yamagishi et al., 2019)
Dataset Splits Yes All diffusion models were trained on the same training split as in (Chen et al., 2020). We split the VCTK dataset for training and testing: 100 speakers were used for training the multi-speaker model and 8 speakers for testing. We trained on a 44257-utterance subset (40 hours) and evaluated on a held-out 100-utterance subset.
Hardware Specification Yes The score networks for the LJ and VCTK speech datasets were trained from scratch on a single NVIDIA Tesla P40 GPU with batch size 32 for about 1M steps, which took about 3 days.
Software Dependencies No Our proposed BDDMs and the baseline methods were all implemented with the Pytorch library.
Experiment Setup Yes The score networks for the LJ and VCTK speech datasets were trained from scratch on a single NVIDIA Tesla P40 GPU with batch size 32 for about 1M steps, which took about 3 days. We set τ = 66 for training the BDDM vocoders in this paper. For initializing Algorithm 3 for noise scheduling, we could take as few as 1 training sample for validation, perform a grid search on the hyperparameters {(ˆαN = 0.1αT i, ˆβN = 0.1j)} for i, j = 1, ..., 9, i.e., 81 possibilities in total