Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction
Authors: Zhaoxi Mu, Xinyu Yang, Sining Sun, Qing Yang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results substantiate the effectiveness of our meticulously crafted approach, showcasing a substantial reduction in the likelihood of speaker confusion. Experiments Datasets and Implementation Details The TSE model is trained and evaluated using the widely-used two-speaker mixed dataset WSJ0-2mix (Hershey et al. 2016) and its derivative dataset WSJ0-2mix-extr (Xu et al. 2020). |
| Researcher Affiliation | Collaboration | Zhaoxi Mu1, Xinyu Yang1, Sining Sun2, Qing Yang2 1Xi an Jiaotong University 2Du Xiaoman wsmzxxh@stu.xjtu.edu.cn, yxyphd@mail.xjtu.edu.cn, {sunsining,yangqing}@duxiaoman.com |
| Pseudocode | Yes | Algorithm 1: SDR-TSE Optimization Require: The training data D containing mixed-targetreference speech triplets (y, u, x). 1: Initialize the entire system randomly. 2: while not converged do 3: Sample {(yi, ui, xi)}N i=1 from D . 4: Forward-Propagation 5: Reconstruct the spectrogram { ˆXi}N i=1 of {xi}N i=1 and predict the target speech {ˆui}N i=1. 6: Back-Propagation 7: Update θV by maximizing LLL. 8: Update θEg, θEc, θD, θG and θF by minimizing LKL, LREC, Iv CLUB and LSI-SNR. 9: Update θG and θEg by minimizing LSIM. 10: end while |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | The TSE model is trained and evaluated using the widely-used two-speaker mixed dataset WSJ0-2mix (Hershey et al. 2016) and its derivative dataset WSJ0-2mix-extr (Xu et al. 2020). |
| Dataset Splits | No | The paper states that the model is "trained and evaluated" on WSJ0-2mix and WSJ0-2mix-extr datasets but does not explicitly describe training, validation, and test dataset splits with percentages, counts, or explicit labels like "validation set". |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU or CPU models, memory, or cloud instance types used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies or versions (e.g., programming language versions, library versions, or specific solver versions) used in the experiments. |
| Experiment Setup | Yes | The dimensions dg, dc, ds, and H are all set to 256. The variational approximation network V is implemented using two four-layer fully connected networks to predict the mean and variance of the posterior distribution, respectively. The model encompasses a total of 45M parameters. The weights of LSI-SNR, LREC, LKL, Iv CLUB, LLL and LSIM are set to 1, 10 3, 10 4, 10 4, 10 3 and 10 3, respectively, determined through a grid search. and L, O, and η are set to 250 ms, 125 ms, and 5% of the maximum energy, respectively. |