An efficient encoder-decoder architecture with top-down attention for speech separation

Authors: Kai Li, Runxuan Yang, Xiaolin Hu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On three benchmark datasets, TDANet consistently achieved competitive separation performance to previous state-of-the-art (SOTA) methods with higher efficiency. Specifically, TDANet s multiply-accumulate operations (MACs) are only 5% of Sepformer, one of the previous SOTA models, and CPU inference time is only 10% of Sepformer. In addition, a large-size version of TDANet obtained SOTA results on three datasets, with MACs still only 10% of Sepformer and the CPU inference time only 24% of Sepformer.
Researcher Affiliation Academia Kai Li1, Runxuan Yang1 & Xiaolin Hu1,2,3, 1. Department of Computer Science and Technology, Institute for AI, BNRist, Tsinghua University, Beijing 100084, China 2. Tsinghua Laboratory of Brain and Intelligence (THBI), IDG/Mc Govern Institute for Brain Research, Tsinghua University, Beijing 100084, China 3. Chinese Institute for Brain Research (CIBR), Beijing 100010, China
Pseudocode No No pseudocode or clearly labeled algorithm blocks were found.
Open Source Code Yes The PyTorch implementation of our method is publicly available3. This project is under the MIT license.3https://cslikai.cn/project/TDANet/
Open Datasets Yes We evaluated TDANet and other existing methods on three datasets: Libri2Mix (Cosentino et al., 2020), and WHAM! (Wichern et al., 2019))... We created a new speech separation dataset, LRS2-2Mix, using the LRS2 corpus... It is publicly available2. 2https://drive.google.com/file/d/1dCWD5OIGcj43qTidmU18unoaqo_6QetW/view
Dataset Splits Yes The speaker identities of training/validation set and test set are non-intersecting... The training set, validation set and test set contain 20000, 5000 and 3000 utterances, respectively.
Hardware Specification Yes For all experiments, we used 8 Ge Force RTX 3080 for training and testing.
Software Dependencies No The paper mentions 'PyTorch implementation' and 'PyTorch-Op Counter4' but does not specify their version numbers or the versions of other software dependencies.
Experiment Setup Yes We set the kernel size L of the audio encoder and audio decoder in the overall pipeline to 4 ms and stride size L/4 to 1 ms. The number of down-sampling S was set to 4. The number of channels N of all convolutional layers in each layer was set to 512... we set the number of channels for the three convolutional layers in the FFN layer to (512, 1024, 512), the kernel sizes to (1, 5, 1), the stride sizes to (1, 1, 1), and the bias settings to (False, True, False). To avoid overfitting, we set the probability of all dropouts to 0.1. We trained all models for 500 epochs. The batch size was set to 1 at the utterance level. Our proposed model used the Adam (Kingma & Ba, 2015) optimizer with an initial learning rate of 0.001.