Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction

Authors: Zhaoxi Mu, Xinyu Yang, Sining Sun, Qing Yang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results substantiate the effectiveness of our meticulously crafted approach, showcasing a substantial reduction in the likelihood of speaker confusion. Experiments Datasets and Implementation Details The TSE model is trained and evaluated using the widely-used two-speaker mixed dataset WSJ0-2mix (Hershey et al. 2016) and its derivative dataset WSJ0-2mix-extr (Xu et al. 2020).
Researcher Affiliation Collaboration Zhaoxi Mu1, Xinyu Yang1, Sining Sun2, Qing Yang2 1Xi an Jiaotong University 2Du Xiaoman wsmzxxh@stu.xjtu.edu.cn, yxyphd@mail.xjtu.edu.cn, {sunsining,yangqing}@duxiaoman.com
Pseudocode Yes Algorithm 1: SDR-TSE Optimization Require: The training data D containing mixed-targetreference speech triplets (y, u, x). 1: Initialize the entire system randomly. 2: while not converged do 3: Sample {(yi, ui, xi)}N i=1 from D . 4: Forward-Propagation 5: Reconstruct the spectrogram { ˆXi}N i=1 of {xi}N i=1 and predict the target speech {ˆui}N i=1. 6: Back-Propagation 7: Update θV by maximizing LLL. 8: Update θEg, θEc, θD, θG and θF by minimizing LKL, LREC, Iv CLUB and LSI-SNR. 9: Update θG and θEg by minimizing LSIM. 10: end while
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes The TSE model is trained and evaluated using the widely-used two-speaker mixed dataset WSJ0-2mix (Hershey et al. 2016) and its derivative dataset WSJ0-2mix-extr (Xu et al. 2020).
Dataset Splits No The paper states that the model is "trained and evaluated" on WSJ0-2mix and WSJ0-2mix-extr datasets but does not explicitly describe training, validation, and test dataset splits with percentages, counts, or explicit labels like "validation set".
Hardware Specification No The paper does not provide any specific hardware details such as GPU or CPU models, memory, or cloud instance types used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies or versions (e.g., programming language versions, library versions, or specific solver versions) used in the experiments.
Experiment Setup Yes The dimensions dg, dc, ds, and H are all set to 256. The variational approximation network V is implemented using two four-layer fully connected networks to predict the mean and variance of the posterior distribution, respectively. The model encompasses a total of 45M parameters. The weights of LSI-SNR, LREC, LKL, Iv CLUB, LLL and LSIM are set to 1, 10 3, 10 4, 10 4, 10 3 and 10 3, respectively, determined through a grid search. and L, O, and η are set to 250 ms, 125 ms, and 5% of the maximum energy, respectively.