Joint Time-Frequency and Time Domain Learning for Speech Enhancement

Authors: Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Wenxuan Xie, Wenjun Zeng

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments, 4.4 Ablation Study, 4.5 Comparision to State-of-the-Arts
Researcher Affiliation Industry Microsoft Research Asia {chutan, cluo, zhiyzh, wenxie, wezeng}@microsoft.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes AVSpeech+Audio Set Audios from the AVSpeech dataset are used as clean speech. It is a large dataset proposed by [Ephrat et al., 2018]. ... Audio Set [Gemmeke et al., 2017] ... Voice Bank+DEMAND This is an open dataset proposed by [Valentini-Botinhao et al., 2016]. Speeches of 30 speakers selected from Voice Bank corpus [Veaux et al., 2013] are used as clean speech: ... with the DEMAND dataset [Thiemann et al., 2013].
Dataset Splits Yes In our experiments, 100k segments are randomly sampled from the AVSpeech dataset and the Balanced Train part of Audio Set are used to synthesize the training set, while the validation set is the same as the one used in [Ephrat et al., 2018], synthesized by the test part of AVSpeech dataset and the evaluation part of Audio Set. Speeches of 30 speakers selected from Voice Bank corpus [Veaux et al., 2013] are used as clean speech: 28 are included in the training set and 2 are in the validation set. ... Finally, the training and test set contain 11572 and 824 noisy-clean speech pairs, respectively.
Hardware Specification No The paper mentions training models but does not provide specific details regarding the hardware used, such as GPU models (e.g., NVIDIA A100, RTX 2080 Ti), CPU models, or memory specifications.
Software Dependencies No The paper states 'Our method is implemented in Pytorch' but does not specify the version number of PyTorch or any other software dependencies with their respective version numbers.
Experiment Setup Yes Adam optimizer with a fixed learning rate of 0.0002 is used and the batch size is 8. Mean SDR and PESQ are used on the test dataset as the evaluation metric. All audios are resampled to 16k Hz. STFT is computed using a Hann window of length 25ms, hop length of 10ms, and FFT size of 512, resulting in an input audio feature of 301 257 2 scalars. Convolution operation with zero padding, dilation=1 and stride=1 is used, making sure the input and output of the features are the same size. The number of channels for each convolution layer is 96. Re LU activations follow all network layers except for head layer (mask). Batch normalization [Ioffe and Szegedy, 2015] is performed after all convolutional layers.