PGSS: Pitch-Guided Speech Separation

Authors: Xiang Li, Yiwen Wang, Yifan Sun, Xihong Wu, Jing Chen

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the WSJ0-2mix corpus reveal that the proposed approaches can achieve higher pitch extraction accuracy and better separation performance, compared to the baseline models, and have the potential to be applied to SOTA architectures.
Researcher Affiliation Academia School of Intelligence Science and Technology, Peking University, Beijing, China chenj@cis.pku.edu.cn
Pseudocode No The paper includes architectural diagrams (Figure 1, Figure 2, Figure 3) but no explicit pseudocode or algorithm blocks with numbered steps or code-like formatting.
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing the code for the described method, nor does it provide a direct link to a source-code repository.
Open Datasets Yes The proposed framework is evaluated on Wall Street Journal (WSJ0) corpus. The WSJ0-2mix and -3mix datasets are the benchmarks designed for speech separation, introduced by (Hershey et al. 2016).
Dataset Splits Yes For WSJ0-2mix, the 30h training set and the 10h validation set contain two-speaker mixtures generated by randomly selecting speakers and utterances from the WSJ0 training set si_tr_s, and mixing them at various Signal-to-Noise Ratios (SNRs) uniformly chosen between 0 d B and 5 d B. The 5h test set was similarly generated using utterances from 18 speakers from the WSJ0 validation set si_dt_05 and evaluation set si_et_05.
Hardware Specification No The paper mentions support from 'the High-performance Computing Platform of Peking University' but does not specify any exact GPU/CPU models, processor types, or memory details used for running experiments.
Software Dependencies No The paper mentions using 'Praat (Boersma 2001)' for reference pitch extraction but does not provide specific version numbers for any key software components, libraries, or frameworks used in the implementation of their models.
Experiment Setup Yes The input magnitudes are computed from STFT with 25 ms window length, 10 ms hop size, and the Hann window. We quantize the frequency range from 60 to 404 Hz into 67 bins using 24 bins per octave in a logarithmic scale.