CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing
Authors: Yen-Ju Lu, Jing Liu, Thomas Thebaud, Laureano Moro-Velazquez, Ariya Rastrow, Najim Dehak, Jesus Villalba
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that CA-SSLR reduces the number of trainable parameters, mitigates overfitting, and excels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a 10% relative reduction in LID errors, a 37% improvement in ASR CER on the ML-SUPERB benchmark, and a 27% decrease in SV EER on Vox Celeb-1, demonstrating its effectiveness. |
| Researcher Affiliation | Academia | Center for Language and Speech Processing, Johns Hopkins University {ylu125, tthebau1, laureano, ndehak3, jvillal7}@jhu.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code, pre-trained models, and reproduction instructions for the experiments are available at https://github.com/neillu23/CA-SSLR. |
| Open Datasets | Yes | We use the ML-SUPERB benchmark [Shi et al., 2023a] for both LID and ASR experiments. ML-SUPERB provides two data configurations: (i) 10-minute/language and (ii) 1-hour/languag, covering 123 well-represented languages (Normal)... As ML-SUPERB lacks speaker labels, we incorporate Vox Celeb2 [Nagrani et al., 2017] to train the models on the SV task. Vox Celeb2 contains 1,092 hours of speech from 5,994 speakers... |
| Dataset Splits | Yes | The 10-minute training set encompasses 37.4 hours of data, and the 1-hour dataset increases the total to 222.4 hours of data. Additionally, the dataset includes development and testing sets, containing 41.8 hours and 45.0 hours of data, respectively. (from A.3.1 ML-SUPERB Dataset). And for Vox Celeb: it contains 1092 hours of audio from 5,994 speakers for training, 110 hours from 4,933 speakers for development, and 20 hours from 40 speakers designated for testing. (from A.3.2 Vox Celeb Dataset). |
| Hardware Specification | Yes | Training a single model requires about one day on 2 A100 GPUs. |
| Software Dependencies | No | We conduct experiments using S3PRL [Yang et al., 2021] and ESPnet [Watanabe et al., 2018]. |
| Experiment Setup | Yes | Detailed information on the remaining hyperparameters is provided in the appendix2. Table 5: Hyper-parameters used for training ASR, LID, and SV decoder models. Table 6: Training hyper-parameters for CA-SSLR models. |