SFSRNet: Super-resolution for Single-Channel Audio Source Separation

Authors: Joel Rixen, Matthias Renz11220-11228

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Based on the WSJ0-2mix benchmark where estimations of the audio signal of two speakers need to be extracted from the mixture, in our experiments our proposed SFSRNet reaches a scale-invariant signal-to-noise-ratio improvement (SI-SNRi) of 24.0 d B outperforming the state-of-the-art solution Sep Former which reaches an SI-SNRi of 22.3 d B.
Researcher Affiliation Academia Joel Rixen, Matthias Renz Department of Computer Science, Kiel University, Germany
Pseudocode No The paper describes methods and architectures but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper provides a link (https://github.com/j-rixen/SFSRNet-Audio-samples) for audio samples, but does not explicitly state that the source code for the methodology is openly available or provide a link to it.
Open Datasets Yes We evaluated our system on the two-speaker speech separation problem using the WSJ0-2mix dataset (Hershey et al. 2016) which is based on the WSJ0 corpus (Garofolo, John S. et al. 1993). Aside from the WSJ02-Mix, the clean and noisy Libri2Mix datasets (Cosentino et al. 2020) are also used to evaluate the SFSRNet. The Libri2Mix datasets are based on the Libri Speech ASR corpus (Panayotov et al. 2015).
Dataset Splits Yes This dataset contains 30 hours of training, 10 hours of validation data and 5 hours of evaluation data. ... The Libri2Mix datasets consist of 212 hours of training, 11 hours of validation and 11 hours of evaluation data.
Hardware Specification Yes The speed is how long it takes for an RTX 2070 Super to separate 1 second given a 4 second mixture.
Software Dependencies No The paper mentions the Adam optimizer and PReLU activation, but does not provide specific version numbers for software dependencies like PyTorch, CUDA, or other libraries used for implementation.
Experiment Setup Yes The encoder has a channel size of 256, a kernel size of 16 and a stride factor of 8. The chunk size is 50 with an overlap of 50%. We use K = 8 Sep Former blocks with P = 2 Intraand Inter Transformers each. The kernel size of the two dimensional convolutional layers of the Transformers is 3. The linear layer preceding the two dimensional convolutional layers has a 512 hidden units. The MHAs use 8 parallel heads. For the SR, we use a frame length of 256 and a frame step of 64 for the STFTs. The convolutional layers have channel sizes of 128, 256, 128 and C, respectively. Their kernel sizes are 5, 9, 11 and 11, respectively. The group size of the group normalization is set to 1. ... For the optimization, the Adam optimizer (Kingma and Ba 2017) is utilized with a learning rate of 15e-5. After the first 100 epochs, the learning rate is halved, once the performance on the validation dataset does not improve for 3 epochs. Gradient clipping is used with a maximum L2-norm of 5. The network is trained for 200 epochs in total.