SCOREQ: Speech Quality Assessment with Contrastive Regression

Authors: Alessandro Ragano, Jan Skoglund, Andrew Hines

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present SCOREQ, a novel approach for speech quality prediction. SCOREQ is a triplet loss function for contrastive regression that addresses the domain generalisation shortcoming exhibited by state of the art no-reference speech quality metrics. ... (iv) evaluate the final model against state of the art models for a wide variety of data and domains. The results show that the lack of generalisation observed in state of the art speech quality metrics is addressed by SCOREQ.
Researcher Affiliation Collaboration Alessandro Ragano School of Computer Science University College Dublin Dublin, Ireland alessandro.ragano@ucd.ie Jan Skoglund Google LLC San Francisco, CA jks@google.com Andrew Hines School of Computer Science University College Dublin Dublin, Ireland andrew.hines@ucd.ie
Pseudocode No The paper describes the SCOREQ loss and training process using mathematical formulas and descriptive text, accompanied by figures (Figure 2, Figure 3, Figure 4), but it does not include a formal pseudocode block or an algorithm box labeled 'Algorithm'.
Open Source Code Yes The repository associated with this paper can be found here https://github.com/alessandroragano/scoreq.
Open Datasets Yes NISQA TRAIN SIM The NISQA TRAIN SIM dataset [43] is part of the NISQA Corpus1 and includes several degradations for telephone speech combined and isolated. ... DNS Squim This dataset was used to train the NR-PESQ and NR-SI SDR models that are available in Torch Audio-Squim [30]. ... Voice MOS Train The training partition of the Voice MOS challenge [22] includes several text-to-speech and voice conversion samples from the Blizzard and the Voice Conversion challenges [10].
Dataset Splits Yes NISQA VAL SIM The NISQA VAL SIM dataset [43] is a partition made with the same conditions and source samples of NISQA TRAIN SIM. This dataset consists of 2500 speech samples with 938 different speakers. We use this dataset for early stopping in all the experiments and to find the best combination of learning rate and batch size of the L2 loss model.
Hardware Specification Yes Every model is trained on an Nvidia Tesla V100 GPU.
Software Dependencies No We trained our models in Py Torch [48] and Torchaudio [62]. The wav2vec 2.0 model is taken from the fairseq toolkit [46]. The Adam optimizer with default Py Torch settings was used.
Experiment Setup Yes SCOREQ loss Training is done by trimming all the input files to 4 seconds which is enough to capture quality and avoids using large memory. ... We finetune the transformer with a learning rate set to 0.00001, while the embedding layer trained from scratch uses a learning rate of 0.001. We use the Adam optimizer with default Py Torch settings except for the learning rate. The batch size is set to 128 in all SCOREQ experiments. ... In all the SCOREQ experiments, training is stopped if the Spearman correlation coefficient (SC) on the NISQA VAL SIM dataset does not improve for more than 100 epochs.