DOSE: Diffusion Dropout with Adaptive Prior for Speech Enhancement

Authors: Wenxin Tai, Yue Lei, Fan Zhou, Goce Trajcevski, Ting Zhong

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our approach yields substantial improvements in high-quality and stable speech generation, consistency with the condition factor, and inference efficiency.
Researcher Affiliation Collaboration Wenxin Tai1, Yue Lei1, Fan Zhou1,2 , Goce Trajcevski3, Ting Zhong1,2 University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry Iowa State University
Pseudocode Yes Algorithm 1 DOSE Training; Algorithm 2 DOSE Sampling
Open Source Code Yes Codes are publicly available at https://github.com/ICDM-UESTC/DOSE.
Open Datasets Yes Following previous works [4, 9, 8], we use the Voice Bank-DEMAND dataset [22, 23] for performance evaluations. To investigate the generalization ability of models, we use CHi ME-4 [24] as another test dataset following [9], i.e., the models are trained on Voice Bank DEMAND and evaluated on CHi ME-4.
Dataset Splits Yes We select the best values for τ1 and τ2 according to the performance on a validation dataset, a small subset (10%) extracted from the training data.
Hardware Specification Yes We train all methods for 300,000 iterations using 1 NVIDIA RTX 3090 GPU with a batch size of 16 audios.
Software Dependencies No The paper mentions using 'Diff Wave [7] as the basic architecture' but does not specify software dependencies like programming languages or libraries with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We train all methods for 300,000 iterations using 1 NVIDIA RTX 3090 GPU with a batch size of 16 audios. Diff Wave takes 50 steps with the linearly spaced training noise schedule βt 1 10 4, 0.035 [4]. We select the best values for τ1 and τ2 according to the performance on a validation dataset, a small subset (10%) extracted from the training data. More experiment settings can be found in Appendix A.10. Specifically, the network is composed of 30 residual layers with residual channels 128. We use a bidirectional dilated convolution (Bi-Dil Conv) with kernel size 3 in each layer. We sum the skip connections from all residual layers. The total number of trainable parameters is 2.31M, slightly smaller than naive Diff Wave (2.64M).