Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer

Authors: Jianwei Zhang, Suren Jayasuriya, Visar Berisha

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement the ICC regularizer and apply it to three speech tasks: speaker verification, voice style conversion, and a clinical application for detecting dysphonic voice. The experimental results demonstrate that adding an ICC regularizer can improve the repeatability of learned embeddings compared to only using the contrastive loss; further, these embeddings lead to improved performance in these downstream tasks.
Researcher Affiliation Academia Jianwei Zhang School of Electrical, Computer and Energy Engineering Arizona State University Tempe, AZ 85281 jianwei.zhang@asu.edu Suren Jayasuriya School of Arts, Media and Engineering School of Electrical, Computer and Energy Engineering Arizona State University Tempe, AZ 85281 sjayasur@asu.edu Visar Berisha College of Health Solutions School of Electrical, Computer and Energy Engineering Arizona State University Tempe, AZ 85281 visar@asu.edu
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes For reproducibility of our work, the code for the ICC regularizer and experiments is available open-source in our Git Hub repository1. 1https://github.com/vigor-jzhang/icc-regularizer/
Open Datasets Yes We use the Vox Celeb 1 & 2 development dataset for training, and Vox Celeb 1 testing dataset for TI-SV performance evaluation [37, 11]. We use the Saarbruecken Voice Database (SVD) [57] as the training and in-corpus validation dataset for dysphonic voice detection, and the m Power corpus [5] is used only for improving the repeatability of voice feature embeddings.
Dataset Splits Yes We use the Vox Celeb 1 & 2 development dataset for training, and Vox Celeb 1 testing dataset for TI-SV performance evaluation [37, 11]. For training VGG-M-40, each batch contains N = 8 speakers and M = 30 utterances per speaker, and the loss formula L(e) = 1.0 Lcontr(e) + 0.06 RICC(e)... The hyper-parameter is tuned on the development dataset. The EER for subjects in the development dataset are used to determine the optimized hyperparamter. We use the Saarbruecken Voice Database (SVD) [57] as the training and in-corpus validation dataset for dysphonic voice detection. We perform cross-validation six times to characterize the variability in performance.
Hardware Specification Yes We use one NVIDIA Titan Xp graphic card to train our models.
Software Dependencies No The paper mentions software tools and optimizers such as 'Adam optimizer', 'SVM', and 'Wav2Vec2' but does not specify their version numbers or other ancillary software dependencies with versions.
Experiment Setup Yes For training VGG-M-40, each batch contains N = 8 speakers and M = 30 utterances per speaker, and the loss formula L(e) = 1.0 Lcontr(e) + 0.06 RICC(e), where the e is the embeddings of speakers in one batch. For training Fast Res Net-34, each batch contains N = 100 speakers and M = 2 utterances per speaker, and the loss formula L(e) = 1.0 Lcontr(e) + 0.25 RICC(e). During training, we use the Adam optimizer, maintaining a static learning rate of 0.001 without implementing any learning rate schedule. The dropout rate is set to 0.2 for all dropout layers. Training loss: We use the following formula for training: L = 0.5 RICC + 1.0 Lcontr + 1.0 Lclass, where the RICC is the ICC regularizer on repeat-constrained embeddings, Lcontr is the contrastive loss on the voice feature embeddings, and Lclass is the classification loss. The SGD optimizer is used with a learning rate of 0.001 and other default settings. We train the model for 20k steps, which takes approximately 16 hours under our configurations.