Interactive Speech and Noise Modeling for Speech Enhancement

Authors: Chengyu Zheng, Xiulian Peng, Yuan Zhang, Sriram Srinivasan, Yan Lu14549-14557

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations on public datasets show that the interaction module plays a key role in simultaneous modeling and the SN-Net outperforms the state-of-the-art by a large margin on various evaluation metrics. The proposed SN-Net also shows superior performance for speaker separation.
Researcher Affiliation Collaboration 1 Communication University of China 2 Microsoft Research Asia 3 Microsoft Corporation
Pseudocode No The paper describes network structures and processes in text and diagrams, but it does not include formal pseudocode or an algorithm block.
Open Source Code No The paper provides links to code for baseline methods (e.g., DTLN and Conv-Tas Net) in footnotes, but does not state that the code for their proposed SN-Net is open-source or provide a link to it.
Open Datasets Yes Three public datasets are used in our experiments. DNS Challenge (Reddy et al. 2020) The DNS challenge (Reddy et al. 2020) at Interspeech 2020 provides a large dataset for training. Voice Bank + DEMAND This is a small dataset created by Valentini-Botinhao et al. (Valentini-Botinhao et al. 2016). TIMIT Corpus This dataset is used for our speaker separation experiment.
Dataset Splits No The paper specifies details for training and test sets but does not explicitly describe a separate validation dataset split with specific percentages or counts.
Hardware Specification No The paper mentions that the algorithm is implemented in TensorFlow but does not provide any specific details about the hardware (e.g., GPU, CPU models, memory) used for running the experiments.
Software Dependencies No The proposed algorithm is implemented in Tensor Flow. However, no specific version numbers for TensorFlow or any other software libraries are provided.
Experiment Setup Yes We use adam optimizer with a learning rate of 0.0002. All the layers are initialized with Xavier initialization. The training is conducted in two stages. The speech and noise branches are jointly trained first with the loss weight α = 1 and β = 0. Then the merge branch is trained with the parameters of previous two fixed, using only the loss LMerge. We train both stages for 60 epochs for DNS Challenge and 400 epochs for Voice Bank + DEMAND dataset. The batch size for all experiments is set to 32, unless otherwise specified.