Large Language Models are Efficient Learners of Noise-Robust Speech Recognition
Authors: Yuchen Hu, CHEN CHEN, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EngSiong Chng
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. We conduct experiments on the proposed Robust HP dataset, which is detailed in A. |
| Researcher Affiliation | Collaboration | Yuchen Hu1, Chen Chen1, Chao-Han Huck Yang2,3, Ruizhe Li4 Chao Zhang5 Pin-Yu Chen6 Eng Siong Chng1 1Nanyang Technological University 2Georgia Institute of Technology 3NVIDIA Research 4University of Aberdeen 5Tsinghua University 6IBM Research |
| Pseudocode | Yes | Algorithm 1 Audio noise distillation via mutual information neural estimation (MINE). |
| Open Source Code | Yes | 1This work is open sourced at: https://github.com/YUCHEN005/Robust GER |
| Open Datasets | Yes | Correspondingly, we develop a Robust Hy Poradise dataset by collecting hypotheses-transcription (HT) pairs from common noisy ASR corpus, including CHi ME-4 (Vincent et al., 2016), Voice Bank DEMAND (Valentini-Botinhao et al., 2016), NOIZEUS (Hu & Loizou, 2006), Libri Speech Free Sound (Prasad et al., 2021) and RATS (Graff et al., 2014), with details provided in A. |
| Dataset Splits | Yes | CHi ME-4 (Vincent et al., 2016): ... We use its tr05-real split (9,600 utterances) to generate Robust HP training data, as well as the test-real (1,320 utterances), test-simu (1,320 utterances), dev-real (1,640 utterances) and dev-simu(1,640 utterances) splits to generate the test data. |
| Hardware Specification | Yes | We use 1 NVIDIA A40 GPU for model training, which takes 1.5 hours for CHi ME-4, 2.0 hours for VB-DEMAND, 1.6 hours for NOIZEUS, 4.5 hours for LS-Free Sound, and 3.8 hours for RATS, respectively. |
| Software Dependencies | No | The paper mentions using 'Whisper Large-V2', 'LLa MA-Adapter', 'sentence-BERT', 'Fast Text', and 'ESPnet toolkit'. While these are specific software components, no version numbers are provided for them or for underlying programming languages/frameworks like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The learning rate is set to 10 2 for CHi ME-4 that is relatively small, and set to 5 10 3 for relatively large datasets including VB-DEMAND, NOIZEUS, LS-Free Sound and RATS. The batch size is set to 4, with accumulation iterations set to 8 (e.g., effective batch size is 32). We train 2 epochs with Adam W optimizer (Loshchilov & Hutter, 2018), with weight decay set to 0.02 and warmup steps set to 20% of one epoch s steps. In addition, MINE is updated using an extra Adam W optimizer with learning rate that is 10% of LLM tuning, where all other configurations keep the same. The hyper-parameter λ in Algorithm 1 is set to 0.5. |