Cross-Talk Reduction
Authors: Zhong-Qiu Wang, Anurag Kumar, Shinji Watanabe
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluation results on a simulated two-speaker CTR task and on a real-recorded conversational speech separation and recognition task show the effectiveness and potential of CTRnet. |
| Researcher Affiliation | Collaboration | Zhong-Qiu Wang1 , Anurag Kumar2 and Shinji Watanabe3 1Southern University of Science and Technology, China 2Meta Reality Labs Research, USA 3Carnegie Mellon University, USA |
| Pseudocode | Yes | At run time, to separate the close-talk speech of an entire session, we run CTRNet in a block-wise way, using the pseudo-code below at each processing block: |
| Open Source Code | No | A sound demo is provided in the link below.2 (Footnote 2: See https://zqwang7.github.io/demos/CTRnet_demo/index.html). This link leads to a sound demo, not the source code for the methodology presented in the paper. |
| Open Datasets | Yes | SMS-WSJ-FF-CT, with FF meaning far-field and CT close-talk, is built upon a simulated dataset named SMSWSJ [Drude et al., 2019]... train and evaluate CTRnet using the real-recorded CHi ME-7 dataset, following the setup of the CHi ME-7 DASR challenge [Cornell et al., 2023]. |
| Dataset Splits | Yes | SMS-WSJ [Drude et al., 2019]... has 33, 561 (≈ 87.4 h), 982 (≈ 2.5 h) and 1, 332 (≈ 3.4 h) 2-speaker mixtures for training, validation and testing. ... CHi ME-7 Dataset... There are 14 (≈ 34 h), 2 (≈ 2 h) and 4 (≈ 5 h) recorded sessions respectively for training, validation and testing. |
| Hardware Specification | No | Experiments of this work used the Bridges2 system at PSC and Delta at NCSA through allocation CIS210014 and IRI120008P from the Advanced Cyberinfrastructure Coordination Ecosystem: Services and Support (ACCESS) program. While specific systems are named, the paper does not specify the exact GPU models, CPU models, or memory configurations of these resources, which are necessary for hardware specification. |
| Software Dependencies | No | The paper mentions several software components like "TF-Grid Net [Wang et al., 2023c]", "torchiva toolkit [Scheibler and Saijo, 2022]", and "Wav LM features", but it does not provide specific version numbers for these or other crucial software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For training, in default we sample an L-second segment from each mixture in each epoch, and the batch size is H. ... For STFT, the window size is 16 ms, hop size 8 ms, and the square root of the Hann window is used as the analysis window. TF-Grid Net [Wang et al., 2023c] is employed as the DNN architecture. Using the symbols defined in Table I of [Wang et al., 2023c], we set its hyper-parameters to D = 128, B = 4, I = 1, J = 1, H = 192, L = 4 and E = 4 (please do not confuse these symbols with the ones defined in this paper). The model has around 4.8 million parameters. ξ in (10) and (11) is tuned to 10^-3. β in (15) is set to 1.0. ... The filter taps I and J are tuned to 19 and 1. ... the processing block size is set to 8 seconds, the same as the segment length used during training. We configure the blocks to be slightly overlapped, where we consider the first and the last 0.96 seconds as context, and output the DNN estimates in the center 6.08 (= 8 − 0.96 − 0.96) seconds at each block. |