reproducibilityindex.ai

Cross-Modal Distillation for Speaker Recognition

Authors: Yufeng Jin, Guosheng Hu, Haonan Chen, Duoqian Miao, Liang Hu, Cairong Zhao

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on the Vox Celeb1 and CN-Celeb datasets show our proposed strategies can effectively improve the accuracy of speaker recognition by a margin of 10% 15%, and our methods are very robust to different noises.
Researcher Affiliation	Collaboration	Yufeng Jin1, Guosheng Hu2, Haonan Chen3, Duoqian Miao1, Liang Hu1, Cairong Zhao1* 1 School of Electronic and Information Engineering, Tongji University, China 2 Oosto, UK 3 Alibaba Group, China
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not include an unambiguous statement that the authors are releasing code for the work described, nor does it provide a direct link to a source-code repository.
Open Datasets	Yes	Our method is trained on the multimodal dataset Vox Celeb2 (Chung, Nagrani, and Zisserman 2018) and evaluated on the speech dataset in Vox Celeb1 (Nagrani, Chung, and Zisserman 2017) and CN-Celeb (Fan et al. 2020).
Dataset Splits	No	The paper states training is done on Vox Celeb2 and evaluation on Vox Celeb1 and CN-Celeb, but does not specify a separate validation split or its size/methodology. It focuses on train and test/evaluation splits.
Hardware Specification	Yes	The network is trained for 36 epochs, on an Nvidia RTX 3090 GPU, it takes about 9 hours to train X-Vector and about 2 days to train Res Net34.
Software Dependencies	No	The paper mentions using the Adam optimizer and models like Res Net34 and X-Vector, but does not specify software dependencies with version numbers such as Python, PyTorch, TensorFlow, or CUDA versions.
Experiment Setup	Yes	We use the Adam optimizer with an initial learning rate of 1e-3 decreasing by 25% every 3 epochs and a weight decay of 5e-5. Each batch has 100 speakers, and each speaker has 2 audio utterances. The network is trained for 36 epochs, on an Nvidia RTX 3090 GPU, it takes about 9 hours to train X-Vector and about 2 days to train Res Net34.