reproducibilityindex.ai

SCOREQ: Speech Quality Assessment with Contrastive Regression

Authors: Alessandro Ragano, Jan Skoglund, Andrew Hines

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we present SCOREQ, a novel approach for speech quality prediction. SCOREQ is a triplet loss function for contrastive regression that addresses the domain generalisation shortcoming exhibited by state of the art no-reference speech quality metrics. ... (iv) evaluate the final model against state of the art models for a wide variety of data and domains. The results show that the lack of generalisation observed in state of the art speech quality metrics is addressed by SCOREQ.
Researcher Affiliation	Collaboration	Alessandro Ragano School of Computer Science University College Dublin Dublin, Ireland alessandro.ragano@ucd.ie Jan Skoglund Google LLC San Francisco, CA jks@google.com Andrew Hines School of Computer Science University College Dublin Dublin, Ireland andrew.hines@ucd.ie
Pseudocode	No	The paper describes the SCOREQ loss and training process using mathematical formulas and descriptive text, accompanied by figures (Figure 2, Figure 3, Figure 4), but it does not include a formal pseudocode block or an algorithm box labeled 'Algorithm'.
Open Source Code	Yes	The repository associated with this paper can be found here https://github.com/alessandroragano/scoreq.
Open Datasets	Yes	NISQA TRAIN SIM The NISQA TRAIN SIM dataset [43] is part of the NISQA Corpus1 and includes several degradations for telephone speech combined and isolated. ... DNS Squim This dataset was used to train the NR-PESQ and NR-SI SDR models that are available in Torch Audio-Squim [30]. ... Voice MOS Train The training partition of the Voice MOS challenge [22] includes several text-to-speech and voice conversion samples from the Blizzard and the Voice Conversion challenges [10].
Dataset Splits	Yes	NISQA VAL SIM The NISQA VAL SIM dataset [43] is a partition made with the same conditions and source samples of NISQA TRAIN SIM. This dataset consists of 2500 speech samples with 938 different speakers. We use this dataset for early stopping in all the experiments and to find the best combination of learning rate and batch size of the L2 loss model.
Hardware Specification	Yes	Every model is trained on an Nvidia Tesla V100 GPU.
Software Dependencies	No	We trained our models in Py Torch [48] and Torchaudio [62]. The wav2vec 2.0 model is taken from the fairseq toolkit [46]. The Adam optimizer with default Py Torch settings was used.
Experiment Setup	Yes	SCOREQ loss Training is done by trimming all the input files to 4 seconds which is enough to capture quality and avoids using large memory. ... We finetune the transformer with a learning rate set to 0.00001, while the embedding layer trained from scratch uses a learning rate of 0.001. We use the Adam optimizer with default Py Torch settings except for the learning rate. The batch size is set to 128 in all SCOREQ experiments. ... In all the SCOREQ experiments, training is stopped if the Spearman correlation coefficient (SC) on the NISQA VAL SIM dataset does not improve for more than 100 epochs.