UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures
Authors: Zhong-Qiu Wang, Shinji Watanabe
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluation results on two-speaker separation in reverberant conditions show the effectiveness and potential of UNSSOR. and 5 Experimental setup, 6 Evaluation results |
| Researcher Affiliation | Academia | Zhong-Qiu Wang and Shinji Watanabe Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA wang.zhongqiu41@gmail.com |
| Pseudocode | No | The paper describes the algorithm steps in text and equations but does not provide structured pseudocode or an algorithm block. |
| Open Source Code | No | A sound demo is available at this link6. (Note 6: https://zqwang7.github.io/demos/UNSSOR-demo/index.html). This is for a demo, not the source code itself. |
| Open Datasets | Yes | We validate the proposed algorithms on two-speaker separation in reverberant conditions based on the six-channel SMS-WSJ dataset [67] and Appendix A: SMS-WSJ [67] is a popular corpus for evaluating two-speaker separation algorithms in reverberant conditions. The clean speech is sampled from the WSJ0 and WSJ1 datasets. The corpus contains 33, 561 (~87.4 h), 982 (~2.5 h), and 1, 332 (~3.4 h) two-speaker mixtures respectively for training, validation, and testing. |
| Dataset Splits | Yes | The corpus contains 33, 561 (~87.4 h), 982 (~2.5 h), and 1, 332 (~3.4 h) two-speaker mixtures respectively for training, validation, and testing. |
| Hardware Specification | Yes | For each model, an Nvidia A100 40GB GPU is used for training, and the model converges in three to four days. and We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the RTX 8000 GPUs used in this research. |
| Software Dependencies | Yes | Adam (with the default setup in Pytorch v1.9) is used as the optimizer. |
| Experiment Setup | Yes | In default, for STFT, the window size is 32 ms, the hop size is 8 ms, and the square-root Hann window is used as the analysis window. and Using the symbols defined in Table I of [23], we set its hyper-parameters to D = 48, B = 4, I = 4, J = 1, H = 192, L = 4 and E = 4 for 8 k Hz sampling rate. and The learning rate starts from 10^-3 and is halved if the validation loss is not improved in two epochs. We terminate training once the learning rate is reduced to 6.25e-5. The batch size is set to four, with each segment being 4-second long. |