reproducibilityindex.ai

UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures

Authors: Zhong-Qiu Wang, Shinji Watanabe

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation results on two-speaker separation in reverberant conditions show the effectiveness and potential of UNSSOR. and 5 Experimental setup, 6 Evaluation results
Researcher Affiliation	Academia	Zhong-Qiu Wang and Shinji Watanabe Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA wang.zhongqiu41@gmail.com
Pseudocode	No	The paper describes the algorithm steps in text and equations but does not provide structured pseudocode or an algorithm block.
Open Source Code	No	A sound demo is available at this link6. (Note 6: https://zqwang7.github.io/demos/UNSSOR-demo/index.html). This is for a demo, not the source code itself.
Open Datasets	Yes	We validate the proposed algorithms on two-speaker separation in reverberant conditions based on the six-channel SMS-WSJ dataset [67] and Appendix A: SMS-WSJ [67] is a popular corpus for evaluating two-speaker separation algorithms in reverberant conditions. The clean speech is sampled from the WSJ0 and WSJ1 datasets. The corpus contains 33, 561 (~87.4 h), 982 (~2.5 h), and 1, 332 (~3.4 h) two-speaker mixtures respectively for training, validation, and testing.
Dataset Splits	Yes	The corpus contains 33, 561 (~87.4 h), 982 (~2.5 h), and 1, 332 (~3.4 h) two-speaker mixtures respectively for training, validation, and testing.
Hardware Specification	Yes	For each model, an Nvidia A100 40GB GPU is used for training, and the model converges in three to four days. and We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the RTX 8000 GPUs used in this research.
Software Dependencies	Yes	Adam (with the default setup in Pytorch v1.9) is used as the optimizer.
Experiment Setup	Yes	In default, for STFT, the window size is 32 ms, the hop size is 8 ms, and the square-root Hann window is used as the analysis window. and Using the symbols defined in Table I of [23], we set its hyper-parameters to D = 48, B = 4, I = 4, J = 1, H = 192, L = 4 and E = 4 for 8 k Hz sampling rate. and The learning rate starts from 10^-3 and is halved if the validation loss is not improved in two epochs. We terminate training once the learning rate is reduced to 6.25e-5. The batch size is set to four, with each segment being 4-second long.