Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models
Authors: Yuchen Hu, CHEN CHEN, Chao-Han Yang, Chengwei Qin, Pin-Yu Chen, Eng-Siong Chng, Chao Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments evaluate the proposed STAR in various practical scenarios, including background noise, speaker accents, and specific scenarios (e.g., interviews and talks). Comprehensive results show the significant gains from STAR that enhances Whisper by an average of 13.5% relative word error rate (WER) reduction across 14 target domains. |
| Researcher Affiliation | Collaboration | Yuchen Hu1, Chen Chen1, Chao-Han Huck Yang2 Chengwei Qin1 Pin-Yu Chen3 Eng Siong Chng1 Chao Zhang4 1Nanyang Technological University 2NVIDIA Research 3IBM Research 4Tsinghua University |
| Pseudocode | Yes | Algorithm 1 Self-Taught Recognizer (STAR) adaptation. |
| Open Source Code | Yes | Our code is publicly available at: https://github.com/YUCHEN005/STAR-Adapt. |
| Open Datasets | Yes | All the data used in this paper are publicly available and under the following licenses: the Creative Commons BY-NC-ND 3.0 License, Creative Commons BY-NC-ND 4.0 License, Creative Commons BY NC-SA 4.0 License, Creative Commons Attribution 4.0 International License, Creative Commons (CC0) License, the LDC User Agreement for Non-Members, the TED Terms of Use, the You Tube s Terms of Service, and the BBC s Terms of Use. |
| Dataset Splits | Yes | We use its tr05-real split (9,600 utterances) as the target-domain unlabeled training data, as well as the test-real (1,320 utterances), test-simu (1,320 utterances), dev-real (1,640 utterances) and dev-simu(1,640 utterances) splits for testing. |
| Hardware Specification | Yes | This remarkable data efficiency significantly saves the labours in real-world applications: not only is there no need for manual labelling, but the collection of unlabeled data also requires less than one hour. which cost around 0.8-hour training time on single NVIDIA-A100-40GB GPU. |
| Software Dependencies | No | The paper mentions software like the 'Whisper-Large-V3 model' and 'Adam optimizer [45]' but does not provide specific version numbers for these or other software components used in the experiments. |
| Experiment Setup | Yes | We use the Whisper-Large-V3 model for main experiments, which contains 1.5 billion parameters trained on 680k-hour web-scale data. It is fine-tuned using Adam optimizer [45] with an initial learning rate of 1e 5 for 2 epochs. The batch size is set to 1 with 16 gradient accumulation steps. For hyper-parameters, the threshold λ is set to 2 and the temperature τ is 10. In addition, the percentile α of utterance-level filtering is 20, which shows consistent effectiveness across different datasets. |