Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments

Authors: Shitong Xu, Yiyuan Yang, Niki Trigoni, Andrew Markham

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show the effectiveness of our model architecture, which achieves over 2.1 d B higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 d B SNR by 60%.
Researcher Affiliation	Academia	Department of Computer Science, University of Oxford EMAIL
Pseudocode	Yes	The pseudo-code for the Encoder Fusion Module is shown in Figure 3.
Open Source Code	Yes	Our implementation is available at https://github.com/xu-shitong/TSE-through-Positive-Negative-Enroll .
Open Datasets	Yes	Datasets We construct our Positive, Negative, and Mixed Audios from the samples in the Libri Speech dataset [30], with background noise n{M,P,N} from the WHAM! dataset [31].
Dataset Splits	Yes	We generate our training, validation, and testing data from the train-clean-360, dev-clean, and test-clean components from the Libri Speech dataset [30], respectively.
Hardware Specification	Yes	All the training is done on a single Nvidia A10 24GB GPU with a batch size of 2.
Software Dependencies	No	The paper mentions using 'built-in Pytorch function', 'Adam optimizer', and 'Web RTC Voice Activity Detector [38]' but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	All the training is done on a single Nvidia A10 24GB GPU with a batch size of 2. In both training stages, we use the Adam optimizer and decay the learning rate by half when the validation loss does not decrease for more than 50 epochs. In particular, the initial learning rate is set to 5e-4 for the whole Siamese Encoder, 1e-3 for the Encoder Fusion Module in the first training stage, and 2e-3 for the whole extraction branch in the second training stage. 500 epochs (200k optimization steps) are used in the first pretraining stage, and 1000 epochs (400k optimization steps) are used in the second stage.