BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis
Authors: Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted a series of experiments on neural vocoding tasks to evaluate the proposed BDDMs. |
| Researcher Affiliation | Industry | Max W. Y. Lam, Jun Wang, Dan Su Tencent AI Lab Shenzhen, China {maxwylam, joinerwang, dansu}@tencent.com Dong Yu Tencent AI Lab Bellevue WA, USA dyu@tencent.com |
| Pseudocode | Yes | Algorithm 1 Training Score Network (θ) |
| Open Source Code | Yes | We release our code at https://github.com/tencent-ailab/bddm. |
| Open Datasets | Yes | we used the LJSpeech dataset (Ito & Johnson, 2017), which consists of 13,100 22k Hz audio clips of a female speaker. We also replicated the comparative experiment of neural vocoding using a multi-speaker VCTK dataset (Yamagishi et al., 2019) |
| Dataset Splits | Yes | All diffusion models were trained on the same training split as in (Chen et al., 2020). We split the VCTK dataset for training and testing: 100 speakers were used for training the multi-speaker model and 8 speakers for testing. We trained on a 44257-utterance subset (40 hours) and evaluated on a held-out 100-utterance subset. |
| Hardware Specification | Yes | The score networks for the LJ and VCTK speech datasets were trained from scratch on a single NVIDIA Tesla P40 GPU with batch size 32 for about 1M steps, which took about 3 days. |
| Software Dependencies | No | Our proposed BDDMs and the baseline methods were all implemented with the Pytorch library. |
| Experiment Setup | Yes | The score networks for the LJ and VCTK speech datasets were trained from scratch on a single NVIDIA Tesla P40 GPU with batch size 32 for about 1M steps, which took about 3 days. We set τ = 66 for training the BDDM vocoders in this paper. For initializing Algorithm 3 for noise scheduling, we could take as few as 1 training sample for validation, perform a grid search on the hyperparameters {(ˆαN = 0.1αT i, ˆβN = 0.1j)} for i, j = 1, ..., 9, i.e., 81 possibilities in total |