reproducibilityindex.ai

Multi-Scale Group Transformer for Long Sequence Modeling in Speech Separation

Authors: Yucheng Zhao, Chong Luo, Zheng-Jun Zha, Wenjun Zeng

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use two datasets for evaluation. The ﬁrst is the widely-used WSJ0-2mix dataset [Hershey et al., 2016]. WSJ0-2mix contains 30 hours of training data, 10 hours of validation data and 5 hours of testing data... We train all models for 1M steps... Table 2 shows speech separation results on WSJ0-2mix.
Researcher Affiliation	Collaboration	Yucheng Zhao 1 , Chong Luo2 , Zheng-Jun Zha 1 and Wenjun Zeng2 1University of Science and Technology of China 2Microsoft Research Asia
Pseudocode	No	No structured pseudocode or algorithm blocks were found.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	Yes	We use two datasets for evaluation. The ﬁrst is the widely-used WSJ0-2mix dataset [Hershey et al., 2016]. ... The second is the recently proposed WHAM! dataset [Wichern et al., 2019], which is an extension of the original WSJ0-2mix.
Dataset Splits	Yes	WSJ0-2mix contains 30 hours of training data, 10 hours of validation data and 5 hours of testing data.
Hardware Specification	Yes	The numbers are acquired on a single P100 GPU.
Software Dependencies	No	The paper mentions software components like 'Adam optimizer' but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We train all models for 1M steps on 4-second segments with sample rate of 8K Hz. We use Adam optimizer with warmup. The learning rate is initialized to 0.0003 and is adjusted according to the following formula: lr = init lr min(step 0.3, step warmup steps 1.3) (7) We choose warmup steps = 10000. We also use dropout to relieve over-ﬁtting. In all experiments, we use frame length of 2 ms. We choose group size of 1000 for noise-free speech separation and group size of 500 for noisy speech separation. Following the notations in [Vaswani et al., 2017], the Transformer parameters are dff = 1024, dmodel = 512, and h = 8. In the light fusion, we use 8 layers of transformation for the GT in the smallest scale and 2 layers for the GT in the other scales. In the dense fusion, we use 3 layers of transformation for GT in all the scales. The output feature dimension of the encoder is 1024.