reproducibilityindex.ai

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Authors: Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted a series of experiments on neural vocoding tasks to evaluate the proposed BDDMs.
Researcher Affiliation	Industry	Max W. Y. Lam, Jun Wang, Dan Su Tencent AI Lab Shenzhen, China {maxwylam, joinerwang, dansu}@tencent.com Dong Yu Tencent AI Lab Bellevue WA, USA dyu@tencent.com
Pseudocode	Yes	Algorithm 1 Training Score Network (θ)
Open Source Code	Yes	We release our code at https://github.com/tencent-ailab/bddm.
Open Datasets	Yes	we used the LJSpeech dataset (Ito & Johnson, 2017), which consists of 13,100 22k Hz audio clips of a female speaker. We also replicated the comparative experiment of neural vocoding using a multi-speaker VCTK dataset (Yamagishi et al., 2019)
Dataset Splits	Yes	All diffusion models were trained on the same training split as in (Chen et al., 2020). We split the VCTK dataset for training and testing: 100 speakers were used for training the multi-speaker model and 8 speakers for testing. We trained on a 44257-utterance subset (40 hours) and evaluated on a held-out 100-utterance subset.
Hardware Specification	Yes	The score networks for the LJ and VCTK speech datasets were trained from scratch on a single NVIDIA Tesla P40 GPU with batch size 32 for about 1M steps, which took about 3 days.
Software Dependencies	No	Our proposed BDDMs and the baseline methods were all implemented with the Pytorch library.
Experiment Setup	Yes	The score networks for the LJ and VCTK speech datasets were trained from scratch on a single NVIDIA Tesla P40 GPU with batch size 32 for about 1M steps, which took about 3 days. We set τ = 66 for training the BDDM vocoders in this paper. For initializing Algorithm 3 for noise scheduling, we could take as few as 1 training sample for validation, perform a grid search on the hyperparameters {(ˆαN = 0.1αT i, ˆβN = 0.1j)} for i, j = 1, ..., 9, i.e., 81 possibilities in total