reproducibilityindex.ai

CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing

Authors: Yen-Ju Lu, Jing Liu, Thomas Thebaud, Laureano Moro-Velazquez, Ariya Rastrow, Najim Dehak, Jesus Villalba

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that CA-SSLR reduces the number of trainable parameters, mitigates overfitting, and excels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a 10% relative reduction in LID errors, a 37% improvement in ASR CER on the ML-SUPERB benchmark, and a 27% decrease in SV EER on Vox Celeb-1, demonstrating its effectiveness.
Researcher Affiliation	Academia	Center for Language and Speech Processing, Johns Hopkins University {ylu125, tthebau1, laureano, ndehak3, jvillal7}@jhu.edu
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code, pre-trained models, and reproduction instructions for the experiments are available at https://github.com/neillu23/CA-SSLR.
Open Datasets	Yes	We use the ML-SUPERB benchmark [Shi et al., 2023a] for both LID and ASR experiments. ML-SUPERB provides two data configurations: (i) 10-minute/language and (ii) 1-hour/languag, covering 123 well-represented languages (Normal)... As ML-SUPERB lacks speaker labels, we incorporate Vox Celeb2 [Nagrani et al., 2017] to train the models on the SV task. Vox Celeb2 contains 1,092 hours of speech from 5,994 speakers...
Dataset Splits	Yes	The 10-minute training set encompasses 37.4 hours of data, and the 1-hour dataset increases the total to 222.4 hours of data. Additionally, the dataset includes development and testing sets, containing 41.8 hours and 45.0 hours of data, respectively. (from A.3.1 ML-SUPERB Dataset). And for Vox Celeb: it contains 1092 hours of audio from 5,994 speakers for training, 110 hours from 4,933 speakers for development, and 20 hours from 40 speakers designated for testing. (from A.3.2 Vox Celeb Dataset).
Hardware Specification	Yes	Training a single model requires about one day on 2 A100 GPUs.
Software Dependencies	No	We conduct experiments using S3PRL [Yang et al., 2021] and ESPnet [Watanabe et al., 2018].
Experiment Setup	Yes	Detailed information on the remaining hyperparameters is provided in the appendix2. Table 5: Hyper-parameters used for training ASR, LID, and SV decoder models. Table 6: Training hyper-parameters for CA-SSLR models.