In-Situ Text-Only Adaptation of Speech Models with Low-Overhead Speech Imputations
Authors: Ashish Mittal, Sunita Sarawagi, Preethi Jyothi
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We establish the effectiveness of TOLSTOI using three target domains and two ASR models of varying complexity. We yield up to 35% relative reduction in word error rate with text only adaptation, while forgetting the least compared to existing adaptation approaches. Our method is easy to implement and can be harnessed on existing RNN-T models without requiring ASR model training from scratch. 4 EXPERIMENTS We present an extensive evaluation of TOLSTOI against three existing approaches with two ASR models and three target domains. |
| Researcher Affiliation | Collaboration | Ashish Mittal IBM Research, IIT Bombay arakeshk@in.ibm.com Sunita Sarawagi & Preethi Jyothi IIT Bombay {sunita,pjyothi}@cse.iitb.ac.in |
| Pseudocode | Yes | Algorithm 1 Text-only adaptation in TOLSTOI |
| Open Source Code | No | The paper states, "We have provided sufficient implementation details of our baseline models, our imputation model and the RNN-T fine-tuning process to help reproduce our main results." However, it does not provide any specific link to source code or explicitly state that the code is publicly available. |
| Open Datasets | Yes | All our experiments are performed on publicly available datasets such as Switchboard, ATIS, Harper Valley Bank, and Librispeech. ... (1) ATIS (Hemphill et al., 1990) ... (2) Harper Valley Bank (HVB) (Wu et al., 2020) ... (3) Librispeech (Panayotov et al., 2015) |
| Dataset Splits | Yes | All our experiments are performed on publicly available datasets such as Switchboard, ATIS, Harper Valley Bank, and Librispeech. We also use published train/test splits specified for each of these datasets, thus enabling reproducibility. ... ATIS (Hemphill et al., 1990) consists of roughly 5K sentences from the airline reservation domain for training and 893 (speech, text) utterance pairs for testing. ... Harper Valley Bank (HVB) (Wu et al., 2020) consists of roughly 15K sentences from the banking domain for training and 2797 (speech, text) utterance pairs for testing. ... Librispeech (Panayotov et al., 2015) consists of roughly 29K sentences from audiobooks for training and 2619 (speech, text) utterances for testing. |
| Hardware Specification | Yes | A batch size of 64 was used to train the Pytorch models on V100 GPUs. |
| Software Dependencies | No | The paper mentions training Pytorch models and using Adam W and One Cycle LR policy, but it does not specify version numbers for Pytorch or any other libraries/solvers. |
| Experiment Setup | Yes | The RNN-T models were trained for 20 epochs using Adam W (Loshchilov & Hutter, 2017) optimizer with the maximum learning rate of 5e-4 and the One Cycle LR policy (Smith & Topin, 2019) policy consisting of a linear warmup phase from 5e-5 to 5e-4 followed by a linear annealing phase to 0. A batch size of 64 was used to train the Pytorch models on V100 GPUs. ... For fine-tuning of ML using the imputation model, we keep the same optimizer and learning rate scheduler as the starting RNN-T training except that the maximum learning rate used for fine-tuning was 5e-5. We fine-tune for a fixed number of 2000 updates... |