Cross-Modal Distillation for Speaker Recognition
Authors: Yufeng Jin, Guosheng Hu, Haonan Chen, Duoqian Miao, Liang Hu, Cairong Zhao
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on the Vox Celeb1 and CN-Celeb datasets show our proposed strategies can effectively improve the accuracy of speaker recognition by a margin of 10% 15%, and our methods are very robust to different noises. |
| Researcher Affiliation | Collaboration | Yufeng Jin1, Guosheng Hu2, Haonan Chen3, Duoqian Miao1, Liang Hu1, Cairong Zhao1* 1 School of Electronic and Information Engineering, Tongji University, China 2 Oosto, UK 3 Alibaba Group, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include an unambiguous statement that the authors are releasing code for the work described, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | Our method is trained on the multimodal dataset Vox Celeb2 (Chung, Nagrani, and Zisserman 2018) and evaluated on the speech dataset in Vox Celeb1 (Nagrani, Chung, and Zisserman 2017) and CN-Celeb (Fan et al. 2020). |
| Dataset Splits | No | The paper states training is done on Vox Celeb2 and evaluation on Vox Celeb1 and CN-Celeb, but does not specify a separate validation split or its size/methodology. It focuses on train and test/evaluation splits. |
| Hardware Specification | Yes | The network is trained for 36 epochs, on an Nvidia RTX 3090 GPU, it takes about 9 hours to train X-Vector and about 2 days to train Res Net34. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and models like Res Net34 and X-Vector, but does not specify software dependencies with version numbers such as Python, PyTorch, TensorFlow, or CUDA versions. |
| Experiment Setup | Yes | We use the Adam optimizer with an initial learning rate of 1e-3 decreasing by 25% every 3 epochs and a weight decay of 5e-5. Each batch has 100 speakers, and each speaker has 2 audio utterances. The network is trained for 36 epochs, on an Nvidia RTX 3090 GPU, it takes about 9 hours to train X-Vector and about 2 days to train Res Net34. |