Phase-Aware Speech Enhancement with Deep Complex U-Net

Authors: Hyeong-Seok Choi, Jang-Hyun Kim, Jaesung Huh, Adrian Kim, Jung-Woo Ha, Kyogu Lee

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model was evaluated on a mixture of the Voice Bank corpus and DEMAND database, which has been widely used by many deep learning models for speech enhancement. Ablation experiments were conducted on the mixed dataset showing that all three proposed approaches are empirically valid. Experimental results show that the proposed method achieves state-of-the-art performance in all metrics, outperforming previous approaches by a large margin.
Researcher Affiliation Collaboration Hyeong-Seok Choi1,3, Jang-Hyun Kim2,3, Jaesung Huh3, Adrian Kim3, Jung-Woo Ha3, Kyogu Lee1 1Department of Transdisciplinary Studies, Seoul National University, Seoul, Korea 2Department of Mathematical Sciences, Seoul National University, Seoul, Korea 3Clova AI Research, NAVER Corp., Seongnam, Korea kekepa15@snu.ac.kr, blue378@snu.ac.kr, jaesung.huh@navercorp.com, adrian.kim@navercorp.com, jungwoo.ha@navercorp.com, kglee@snu.ac.kr
Pseudocode No The paper describes methods and architectures in text and diagrams (Figures 7-10), but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code No Audio samples are available in the following link: http://kkp15.github.io/Deep Complex Unet. The paper does not provide an explicit statement or link for the open-source code of the methodology itself.
Open Datasets Yes Noise and clean speech recordings were provided from the Diverse Environments Multichannel Acoustic Noise Database (DEMAND) (Thiemann et al., 2013) and the Voice Bank corpus (Veaux et al., 2013), respectively, each recorded with sampling rate of 48k Hz.
Dataset Splits Yes Mixed audio inputs used for training were composed by mixing the two datasets with four signal-to-noise ratio (SNR) settings (15, 10, 5, and 0 (d B)), using 10 types of noise (2 synthetic + 8 from DEMAND) and 28 speakers from the Voice Bank corpus, creating 40 conditional patterns for each speech sample. The test set inputs were made with four SNR settings different from the training set (17.5, 12.5, 7.5, and 2.5 (d B)), using the remaining 5 noise types from DEMAND and 2 speakers from the Voice Bank corpus. Note that the speaker and noise classes were uniquely selected for the training and test sets.
Hardware Specification No All experiments were implemented and fine-tuned with NAVER Smart Machine Learning (NSML) platform (Sung et al., 2017; Kim et al., 2018).
Software Dependencies No The paper mentions using the 'NAVER Smart Machine Learning (NSML) platform' but does not specify any other software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific libraries).
Experiment Setup Yes The original raw waveforms were first downsampled from 48k Hz to 16k Hz. For the actual model input, complex-valued spectrograms were obtained from the downsampled waveforms via STFT with a 64ms sized Hann window and 16ms hop length. The configuration of using a 20-layer Deep Complex U-Net (DCUnet-20) to estimate a tanh bounded c RM, optimized with weighted-SDR loss is used. Both architectures are specified in detail in Appendix B.