An efficient encoder-decoder architecture with top-down attention for speech separation
Authors: Kai Li, Runxuan Yang, Xiaolin Hu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On three benchmark datasets, TDANet consistently achieved competitive separation performance to previous state-of-the-art (SOTA) methods with higher efficiency. Specifically, TDANet s multiply-accumulate operations (MACs) are only 5% of Sepformer, one of the previous SOTA models, and CPU inference time is only 10% of Sepformer. In addition, a large-size version of TDANet obtained SOTA results on three datasets, with MACs still only 10% of Sepformer and the CPU inference time only 24% of Sepformer. |
| Researcher Affiliation | Academia | Kai Li1, Runxuan Yang1 & Xiaolin Hu1,2,3, 1. Department of Computer Science and Technology, Institute for AI, BNRist, Tsinghua University, Beijing 100084, China 2. Tsinghua Laboratory of Brain and Intelligence (THBI), IDG/Mc Govern Institute for Brain Research, Tsinghua University, Beijing 100084, China 3. Chinese Institute for Brain Research (CIBR), Beijing 100010, China |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found. |
| Open Source Code | Yes | The PyTorch implementation of our method is publicly available3. This project is under the MIT license.3https://cslikai.cn/project/TDANet/ |
| Open Datasets | Yes | We evaluated TDANet and other existing methods on three datasets: Libri2Mix (Cosentino et al., 2020), and WHAM! (Wichern et al., 2019))... We created a new speech separation dataset, LRS2-2Mix, using the LRS2 corpus... It is publicly available2. 2https://drive.google.com/file/d/1dCWD5OIGcj43qTidmU18unoaqo_6QetW/view |
| Dataset Splits | Yes | The speaker identities of training/validation set and test set are non-intersecting... The training set, validation set and test set contain 20000, 5000 and 3000 utterances, respectively. |
| Hardware Specification | Yes | For all experiments, we used 8 Ge Force RTX 3080 for training and testing. |
| Software Dependencies | No | The paper mentions 'PyTorch implementation' and 'PyTorch-Op Counter4' but does not specify their version numbers or the versions of other software dependencies. |
| Experiment Setup | Yes | We set the kernel size L of the audio encoder and audio decoder in the overall pipeline to 4 ms and stride size L/4 to 1 ms. The number of down-sampling S was set to 4. The number of channels N of all convolutional layers in each layer was set to 512... we set the number of channels for the three convolutional layers in the FFN layer to (512, 1024, 512), the kernel sizes to (1, 5, 1), the stride sizes to (1, 1, 1), and the bias settings to (False, True, False). To avoid overfitting, we set the probability of all dropouts to 0.1. We trained all models for 500 epochs. The batch size was set to 1 at the utterance level. Our proposed model used the Adam (Kingma & Ba, 2015) optimizer with an initial learning rate of 0.001. |