Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

Authors: Yuchen Hu, CHEN CHEN, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EngSiong Chng

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. We conduct experiments on the proposed Robust HP dataset, which is detailed in A.
Researcher Affiliation Collaboration Yuchen Hu1, Chen Chen1, Chao-Han Huck Yang2,3, Ruizhe Li4 Chao Zhang5 Pin-Yu Chen6 Eng Siong Chng1 1Nanyang Technological University 2Georgia Institute of Technology 3NVIDIA Research 4University of Aberdeen 5Tsinghua University 6IBM Research
Pseudocode Yes Algorithm 1 Audio noise distillation via mutual information neural estimation (MINE).
Open Source Code Yes 1This work is open sourced at: https://github.com/YUCHEN005/Robust GER
Open Datasets Yes Correspondingly, we develop a Robust Hy Poradise dataset by collecting hypotheses-transcription (HT) pairs from common noisy ASR corpus, including CHi ME-4 (Vincent et al., 2016), Voice Bank DEMAND (Valentini-Botinhao et al., 2016), NOIZEUS (Hu & Loizou, 2006), Libri Speech Free Sound (Prasad et al., 2021) and RATS (Graff et al., 2014), with details provided in A.
Dataset Splits Yes CHi ME-4 (Vincent et al., 2016): ... We use its tr05-real split (9,600 utterances) to generate Robust HP training data, as well as the test-real (1,320 utterances), test-simu (1,320 utterances), dev-real (1,640 utterances) and dev-simu(1,640 utterances) splits to generate the test data.
Hardware Specification Yes We use 1 NVIDIA A40 GPU for model training, which takes 1.5 hours for CHi ME-4, 2.0 hours for VB-DEMAND, 1.6 hours for NOIZEUS, 4.5 hours for LS-Free Sound, and 3.8 hours for RATS, respectively.
Software Dependencies No The paper mentions using 'Whisper Large-V2', 'LLa MA-Adapter', 'sentence-BERT', 'Fast Text', and 'ESPnet toolkit'. While these are specific software components, no version numbers are provided for them or for underlying programming languages/frameworks like Python, PyTorch, or CUDA.
Experiment Setup Yes The learning rate is set to 10 2 for CHi ME-4 that is relatively small, and set to 5 10 3 for relatively large datasets including VB-DEMAND, NOIZEUS, LS-Free Sound and RATS. The batch size is set to 4, with accumulation iterations set to 8 (e.g., effective batch size is 32). We train 2 epochs with Adam W optimizer (Loshchilov & Hutter, 2018), with weight decay set to 0.02 and warmup steps set to 20% of one epoch s steps. In addition, MINE is updated using an extra Adam W optimizer with learning rate that is 10% of LLM tuning, where all other configurations keep the same. The hyper-parameter λ in Algorithm 1 is set to 0.5.