DOSE: Diffusion Dropout with Adaptive Prior for Speech Enhancement
Authors: Wenxin Tai, Yue Lei, Fan Zhou, Goce Trajcevski, Ting Zhong
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that our approach yields substantial improvements in high-quality and stable speech generation, consistency with the condition factor, and inference efficiency. |
| Researcher Affiliation | Collaboration | Wenxin Tai1, Yue Lei1, Fan Zhou1,2 , Goce Trajcevski3, Ting Zhong1,2 University of Electronic Science and Technology of China Kashi Institute of Electronics and Information Industry Iowa State University |
| Pseudocode | Yes | Algorithm 1 DOSE Training; Algorithm 2 DOSE Sampling |
| Open Source Code | Yes | Codes are publicly available at https://github.com/ICDM-UESTC/DOSE. |
| Open Datasets | Yes | Following previous works [4, 9, 8], we use the Voice Bank-DEMAND dataset [22, 23] for performance evaluations. To investigate the generalization ability of models, we use CHi ME-4 [24] as another test dataset following [9], i.e., the models are trained on Voice Bank DEMAND and evaluated on CHi ME-4. |
| Dataset Splits | Yes | We select the best values for τ1 and τ2 according to the performance on a validation dataset, a small subset (10%) extracted from the training data. |
| Hardware Specification | Yes | We train all methods for 300,000 iterations using 1 NVIDIA RTX 3090 GPU with a batch size of 16 audios. |
| Software Dependencies | No | The paper mentions using 'Diff Wave [7] as the basic architecture' but does not specify software dependencies like programming languages or libraries with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We train all methods for 300,000 iterations using 1 NVIDIA RTX 3090 GPU with a batch size of 16 audios. Diff Wave takes 50 steps with the linearly spaced training noise schedule βt 1 10 4, 0.035 [4]. We select the best values for τ1 and τ2 according to the performance on a validation dataset, a small subset (10%) extracted from the training data. More experiment settings can be found in Appendix A.10. Specifically, the network is composed of 30 residual layers with residual channels 128. We use a bidirectional dilated convolution (Bi-Dil Conv) with kernel size 3 in each layer. We sum the skip connections from all residual layers. The total number of trainable parameters is 2.31M, slightly smaller than naive Diff Wave (2.64M). |