Multi-Scale Group Transformer for Long Sequence Modeling in Speech Separation

Authors: Yucheng Zhao, Chong Luo, Zheng-Jun Zha, Wenjun Zeng

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use two datasets for evaluation. The first is the widely-used WSJ0-2mix dataset [Hershey et al., 2016]. WSJ0-2mix contains 30 hours of training data, 10 hours of validation data and 5 hours of testing data... We train all models for 1M steps... Table 2 shows speech separation results on WSJ0-2mix.
Researcher Affiliation Collaboration Yucheng Zhao 1 , Chong Luo2 , Zheng-Jun Zha 1 and Wenjun Zeng2 1University of Science and Technology of China 2Microsoft Research Asia
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We use two datasets for evaluation. The first is the widely-used WSJ0-2mix dataset [Hershey et al., 2016]. ... The second is the recently proposed WHAM! dataset [Wichern et al., 2019], which is an extension of the original WSJ0-2mix.
Dataset Splits Yes WSJ0-2mix contains 30 hours of training data, 10 hours of validation data and 5 hours of testing data.
Hardware Specification Yes The numbers are acquired on a single P100 GPU.
Software Dependencies No The paper mentions software components like 'Adam optimizer' but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We train all models for 1M steps on 4-second segments with sample rate of 8K Hz. We use Adam optimizer with warmup. The learning rate is initialized to 0.0003 and is adjusted according to the following formula: lr = init lr min(step 0.3, step warmup steps 1.3) (7) We choose warmup steps = 10000. We also use dropout to relieve over-fitting. In all experiments, we use frame length of 2 ms. We choose group size of 1000 for noise-free speech separation and group size of 500 for noisy speech separation. Following the notations in [Vaswani et al., 2017], the Transformer parameters are dff = 1024, dmodel = 512, and h = 8. In the light fusion, we use 8 layers of transformation for the GT in the smallest scale and 2 layers for the GT in the other scales. In the dense fusion, we use 3 layers of transformation for GT in all the scales. The output feature dimension of the encoder is 1024.