Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals

Authors: Jing Shi, Xuankai Chang, Pengcheng Guo, Shinji Watanabe, Yusuke Fujita, Jiaming Xu, Bo Xu, Lei Xie

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on several different tasks including speech separation and multi-speaker speech recognition show that our conditional multi-sequence models lead to consistent improvements over the conventional non-conditional models.
Researcher Affiliation Collaboration 1Center for Language and Speech Processing, Johns Hopkins University, U.S.A 2Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China 3ASLP@NPU, School of Computer Science, Northwestern Polytechnical University, Xi an, China 4Hitachi, Ltd. Research & Development Group, Japan
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our source code and Supplementary Material could be available on our webpage: https://demotoshow.github.io/.
Open Datasets Yes For the speech mixtures, i.e., the input O for our tasks, with different numbers of speakers, data from the Wall Street Journal (WSJ) corpus is used. In the two-speaker scenario, we use the common benchmark called WSJ0-2mix dataset introduced in [15].
Dataset Splits Yes The 30 h training set and the 10 h validation set contains two-speaker mixtures generated by randomly selecting speakers and utterances from the WSJ0 training set si_tr_s, and mixing them at various signal-to-noise ratios (SNRs) uniformly chosen between 0 d B and 10 d B. The 5 h test set was similarly generated using utterances from 18 speakers from the WSJ0 validation set si_dt_05 and evaluation set si_et_05.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions software components like CTC and refers to ESPnet, but does not provide specific version numbers for software dependencies (e.g., library or solver names with versions).
Experiment Setup No In the Section A of Supplementary Material, we provide the implementation details about all our experiments, and we also extend our model to one iterative speech denoising task in Section D.