Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments
Authors: Shitong Xu, Yiyuan Yang, Niki Trigoni, Andrew Markham
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show the effectiveness of our model architecture, which achieves over 2.1 d B higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 d B SNR by 60%. |
| Researcher Affiliation | Academia | Department of Computer Science, University of Oxford EMAIL |
| Pseudocode | Yes | The pseudo-code for the Encoder Fusion Module is shown in Figure 3. |
| Open Source Code | Yes | Our implementation is available at https://github.com/xu-shitong/TSE-through-Positive-Negative-Enroll . |
| Open Datasets | Yes | Datasets We construct our Positive, Negative, and Mixed Audios from the samples in the Libri Speech dataset [30], with background noise n{M,P,N} from the WHAM! dataset [31]. |
| Dataset Splits | Yes | We generate our training, validation, and testing data from the train-clean-360, dev-clean, and test-clean components from the Libri Speech dataset [30], respectively. |
| Hardware Specification | Yes | All the training is done on a single Nvidia A10 24GB GPU with a batch size of 2. |
| Software Dependencies | No | The paper mentions using 'built-in Pytorch function', 'Adam optimizer', and 'Web RTC Voice Activity Detector [38]' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | All the training is done on a single Nvidia A10 24GB GPU with a batch size of 2. In both training stages, we use the Adam optimizer and decay the learning rate by half when the validation loss does not decrease for more than 50 epochs. In particular, the initial learning rate is set to 5e-4 for the whole Siamese Encoder, 1e-3 for the Encoder Fusion Module in the first training stage, and 2e-3 for the whole extraction branch in the second training stage. 500 epochs (200k optimization steps) are used in the first pretraining stage, and 1000 epochs (400k optimization steps) are used in the second stage. |