Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer
Authors: Jianwei Zhang, Suren Jayasuriya, Visar Berisha
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement the ICC regularizer and apply it to three speech tasks: speaker verification, voice style conversion, and a clinical application for detecting dysphonic voice. The experimental results demonstrate that adding an ICC regularizer can improve the repeatability of learned embeddings compared to only using the contrastive loss; further, these embeddings lead to improved performance in these downstream tasks. |
| Researcher Affiliation | Academia | Jianwei Zhang School of Electrical, Computer and Energy Engineering Arizona State University Tempe, AZ 85281 jianwei.zhang@asu.edu Suren Jayasuriya School of Arts, Media and Engineering School of Electrical, Computer and Energy Engineering Arizona State University Tempe, AZ 85281 sjayasur@asu.edu Visar Berisha College of Health Solutions School of Electrical, Computer and Energy Engineering Arizona State University Tempe, AZ 85281 visar@asu.edu |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | For reproducibility of our work, the code for the ICC regularizer and experiments is available open-source in our Git Hub repository1. 1https://github.com/vigor-jzhang/icc-regularizer/ |
| Open Datasets | Yes | We use the Vox Celeb 1 & 2 development dataset for training, and Vox Celeb 1 testing dataset for TI-SV performance evaluation [37, 11]. We use the Saarbruecken Voice Database (SVD) [57] as the training and in-corpus validation dataset for dysphonic voice detection, and the m Power corpus [5] is used only for improving the repeatability of voice feature embeddings. |
| Dataset Splits | Yes | We use the Vox Celeb 1 & 2 development dataset for training, and Vox Celeb 1 testing dataset for TI-SV performance evaluation [37, 11]. For training VGG-M-40, each batch contains N = 8 speakers and M = 30 utterances per speaker, and the loss formula L(e) = 1.0 Lcontr(e) + 0.06 RICC(e)... The hyper-parameter is tuned on the development dataset. The EER for subjects in the development dataset are used to determine the optimized hyperparamter. We use the Saarbruecken Voice Database (SVD) [57] as the training and in-corpus validation dataset for dysphonic voice detection. We perform cross-validation six times to characterize the variability in performance. |
| Hardware Specification | Yes | We use one NVIDIA Titan Xp graphic card to train our models. |
| Software Dependencies | No | The paper mentions software tools and optimizers such as 'Adam optimizer', 'SVM', and 'Wav2Vec2' but does not specify their version numbers or other ancillary software dependencies with versions. |
| Experiment Setup | Yes | For training VGG-M-40, each batch contains N = 8 speakers and M = 30 utterances per speaker, and the loss formula L(e) = 1.0 Lcontr(e) + 0.06 RICC(e), where the e is the embeddings of speakers in one batch. For training Fast Res Net-34, each batch contains N = 100 speakers and M = 2 utterances per speaker, and the loss formula L(e) = 1.0 Lcontr(e) + 0.25 RICC(e). During training, we use the Adam optimizer, maintaining a static learning rate of 0.001 without implementing any learning rate schedule. The dropout rate is set to 0.2 for all dropout layers. Training loss: We use the following formula for training: L = 0.5 RICC + 1.0 Lcontr + 1.0 Lclass, where the RICC is the ICC regularizer on repeat-constrained embeddings, Lcontr is the contrastive loss on the voice feature embeddings, and Lclass is the classification loss. The SGD optimizer is used with a learning rate of 0.001 and other default settings. We train the model for 20k steps, which takes approximately 16 hours under our configurations. |