Voice Separation with an Unknown Number of Multiple Speakers

Authors: Eliya Nachmani, Yossi Adi, Lior Wolf

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 3. Experiments For research purposes only, we use the WSJ0-2mix and WSJ0-3mix datasets (Hershey et al., 2016) and we further expand the WSJ-mix dataset to four and five speakers and introduce WSJ0-4mix and WSJ0-5mix datasets. All of the aforementioned datasets are based on the WSJ0 corpus (Garofolo et al., 1993). We compare with the following baseline methods: ADANet (Luo et al., 2018), DPCL++ (Isik et al., 2016), CBLDNN-GAT (Li et al., 2018), Tas Net (Luo & Mesgarani, 2018), the Ideal Ratio Mask (IRM), the Ideal Binary Mask (IBM), Conv Tas Net (Luo & Mesgarani, 2019), Furca Ne Xt (Zhang et al., 2020), and DPRNN (Luo et al., 2019c).
Researcher Affiliation Collaboration Eliya Nachmani 1 2 Yossi Adi 1 Lior Wolf 1 2 1Facebook AI Research 2Tel-Aviv University. Correspondence to: Eliya Nachmani <enk100@gmail.com>, Yossi Adi <yossiadidrum@gmail.com>, Lior Wolf <liorwolf@gmail.com>.
Pseudocode No The paper provides architectural diagrams (Figures 1, 2, 3) and descriptive text about the model and training objective, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes For research purposes only, WSJ0-4mix and WSJ0-5mix datasets creation scripts are available as supplementary. A separate model is trained for each dataset, with the corresponding number of output channels. Sample results and db creation scripts can be found under the following link: https://enk100.github.io/speaker separation
Open Datasets Yes For research purposes only, we use the WSJ0-2mix and WSJ0-3mix datasets (Hershey et al., 2016) and we further expand the WSJ-mix dataset to four and five speakers and introduce WSJ0-4mix and WSJ0-5mix datasets. All of the aforementioned datasets are based on the WSJ0 corpus (Garofolo et al., 1993). We evaluate the proposed method on the Mus DB dataset (Rafii et al., 2017).
Dataset Splits Yes We use the same procedure as in (Hershey et al., 2016), i.e. we use 30 hours of speech from the training set si tr s to create the training and validation sets. We use the first 84 songs for the training set, the next 16 songs for validation set (we follow the same split as defined in the musdb python package) while the remaining 50 songs are used for test set.
Hardware Specification No The paper describes the model architecture (e.g., 'LSTM layer contains 128 neurons') and training parameters but does not specify any particular hardware components (e.g., GPU models, CPU types) used for the experiments.
Software Dependencies No The paper mentions using 'The ADAM optimizer (Kingma & Ba, 2014)' and 'Demucs package (D efossez et al., 2019)' but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Implementation details We choose hyper parameters based on the validation set. The input kernel size L was 8 (except for the experiment where we vary it) and the number of the filter in the preliminary convolutional layer was 128. We use an audio segment of four seconds long sampled at 8k Hz. The architecture uses b = 6 blocks of MULCAT , where each LSTM layer contains 128 neurons. We multiply the IDloss with 0.001 when combined the u PIT loss. The learning rate was set to 5e 4, which was multiplied by 0.98 every two epoches. The ADAM optimizer (Kingma & Ba, 2014) was used with batch size of 2. For the speaker model, we extract the STFT using a window size of 20ms with stride of 10ms and Hamming window.