Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation
Authors: Mohan Xu, Kai Li, Guo Chen, Xiaolin Hu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results showed that models trained on Echo Set had better generalization ability than those trained on other datasets to the data collected in the physical world, which validated the practical value of the Echo Set. On Echo Set and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing state-of-the-art (SOTA) model TF-Grid Net. |
| Researcher Affiliation | Academia | 1. Department of Computer Science and Technology, Institute for AI, BNRist, Tsinghua University, Beijing 100084, China 2. Tsinghua Laboratory of Brain and Intelligence (THBI), IDG/Mc Govern Institute for Brain Research, Tsinghua University, Beijing 100084, China 3. Chinese Institute for Brain Research (CIBR), Beijing 100010, China EMAIL EMAIL EMAIL |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. It describes methodologies through architectural diagrams and mathematical formulations. |
| Open Source Code | Yes | Code is available at: https://github.com/Jusper Lee/TIGER. |
| Open Datasets | Yes | Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called Echo Set. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. [...] The dataset is available at: https://huggingface.co/datasets/Jusper Lee/Echo Set. [...] For fair comparison with previous speech separation methods (Li et al., 2023; Wang et al., 2023; Hu et al., 2021), we also used two benchmark datasets LRS2-2Mix (Li et al., 2023) and Libri2Mix train-100 min (Cosentino et al., 2020). |
| Dataset Splits | Yes | In total, Echo Set includes 20,268 training utterances, 4,604 validation utterances, and 2,650 test utterances. Each utterance lasts for 6 seconds. [...] LRS2-2Mix (Li et al., 2023). Each audio in this dataset lasts for 2 seconds, at the sampling rate of 16 k Hz. The training set, validation set and test set are about 11.1, 2.8 and 1.7 hours, respectively. [...] During training, we utilized 3-second audio segments for Echo Set and Libri2Mix, and 2-second segments for LRS2-2Mix. |
| Hardware Specification | Yes | Inference speed was measured on NVIDIA RTX 4090 and Intel Xeon Gold 6326. [...] only used a single card when calculating GPU (Ge Force RTX 4090) time. [...] number of threads to 1 when calculating CPU (Intel(R) Xeon(R) Gold 6326) time |
| Software Dependencies | Yes | We used ptflops 0.7.32 to calculate parameters and MACs. |
| Experiment Setup | Yes | In the encoder and decoder, the window and hop size of STFT and iSTFT were set to 640 (40 ms) and 160 (10 ms). We use the Hanning window to mitigate spectrum leakage. [...] We adopt the band-split scheme Low Freq Narrow Split in Table 10. The number of total sub-bands K was 67. For each sub-band, the bandwidth was uniformly transformed into N = 128. In the separator, the FFI blocks which share parameters were repeated B = 4 times for the small version and B = 8 times for the large version. Each MSA moduleโs features were downsampled for D = 4 times, and the hidden layer dimension H was set to 256. For F3A module, the number of attention heads was set to 4. When calculating the query and key in each head of the F3A module, the hidden channel E was set to 4. During training, We used a 3-second audio segment for Echo Set and Libri2Mix, and a 2-second for LRS2-2Mix. We used the maximization of SI-SDR as the training loss (Le Roux et al., 2019). The maximum training round was 500. We used Adam as the optimizer (Kingma & Ba, 2014), with the initial learning rate set to 0.001. If the loss on the validation set did not decrease further within 10 consecutive rounds, the learning rate was halved. When the performance on the validation set did not improve further within 20 consecutive rounds, the training was stopped. |