Multi-Scale Group Transformer for Long Sequence Modeling in Speech Separation
Authors: Yucheng Zhao, Chong Luo, Zheng-Jun Zha, Wenjun Zeng
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use two datasets for evaluation. The first is the widely-used WSJ0-2mix dataset [Hershey et al., 2016]. WSJ0-2mix contains 30 hours of training data, 10 hours of validation data and 5 hours of testing data... We train all models for 1M steps... Table 2 shows speech separation results on WSJ0-2mix. |
| Researcher Affiliation | Collaboration | Yucheng Zhao 1 , Chong Luo2 , Zheng-Jun Zha 1 and Wenjun Zeng2 1University of Science and Technology of China 2Microsoft Research Asia |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We use two datasets for evaluation. The first is the widely-used WSJ0-2mix dataset [Hershey et al., 2016]. ... The second is the recently proposed WHAM! dataset [Wichern et al., 2019], which is an extension of the original WSJ0-2mix. |
| Dataset Splits | Yes | WSJ0-2mix contains 30 hours of training data, 10 hours of validation data and 5 hours of testing data. |
| Hardware Specification | Yes | The numbers are acquired on a single P100 GPU. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer' but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We train all models for 1M steps on 4-second segments with sample rate of 8K Hz. We use Adam optimizer with warmup. The learning rate is initialized to 0.0003 and is adjusted according to the following formula: lr = init lr min(step 0.3, step warmup steps 1.3) (7) We choose warmup steps = 10000. We also use dropout to relieve over-fitting. In all experiments, we use frame length of 2 ms. We choose group size of 1000 for noise-free speech separation and group size of 500 for noisy speech separation. Following the notations in [Vaswani et al., 2017], the Transformer parameters are dff = 1024, dmodel = 512, and h = 8. In the light fusion, we use 8 layers of transformation for the GT in the smallest scale and 2 layers for the GT in the other scales. In the dense fusion, we use 3 layers of transformation for GT in all the scales. The output feature dimension of the encoder is 1024. |