Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling
Authors: Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N Sainath, Yonghui Wu, Ruoming Pang
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present extensive experiments with two state-of-the-art ASR networks, Context Net and Conformer, on two datasets, a widely used public dataset Libri Speech and a large-scale dataset Multi Domain. |
| Researcher Affiliation | Industry | 1Google Brain 2Google LLC |
| Pseudocode | Yes | Algorithm 1 Pseudocode of training Dual-mode ASR networks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We conduct our experiments on two datasets: a public widely used dataset Libri Speech (Panayotov et al., 2015) (1,000 hours of English reading speech) and a large-scale dataset Multi Domain (413,000 hours speech, 287 million utterances of a mixture across multiple domains including Voice Search, You Tube, and Meetings). |
| Dataset Splits | No | For Libri Speech, we report our evaluation results on Test Clean and Test Other (noisy) sets and compare with other published baselines. For Multi Domain, we report our evaluation results on Voice Search test set and compare with our reproduced baselines. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Tensor Flow and Py Torch' as platforms where a function can be implemented but does not provide specific version numbers for these or any other software dependencies, libraries, or solvers used in the experiments. |
| Experiment Setup | Yes | We train our models exactly following our baselines Context Net (Han et al., 2020) and Conformer (Gulati et al., 2020), using Adam optimizer (Kingma & Ba, 2014), Spec Augment (Park et al., 2019) and a transformer learning rate schedule (Vaswani et al., 2017) with warm-up (Goyal et al., 2017). Our main results are summarized in Table 2 and Table 3. We also add a streaming Context Net Look-ahead baseline (6 frames, 10ms per frame, totally 60ms look-ahead latency) in Table 3 by padding additional frames at the end of the input utterances. We do a small-scale hyper-parameter sweep from -2 to 2 frames to shift for Context Net and Conformer in our experiments. |