Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

Authors: Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N Sainath, Yonghui Wu, Ruoming Pang

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present extensive experiments with two state-of-the-art ASR networks, Context Net and Conformer, on two datasets, a widely used public dataset Libri Speech and a large-scale dataset Multi Domain.
Researcher Affiliation Industry 1Google Brain 2Google LLC
Pseudocode Yes Algorithm 1 Pseudocode of training Dual-mode ASR networks.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes We conduct our experiments on two datasets: a public widely used dataset Libri Speech (Panayotov et al., 2015) (1,000 hours of English reading speech) and a large-scale dataset Multi Domain (413,000 hours speech, 287 million utterances of a mixture across multiple domains including Voice Search, You Tube, and Meetings).
Dataset Splits No For Libri Speech, we report our evaluation results on Test Clean and Test Other (noisy) sets and compare with other published baselines. For Multi Domain, we report our evaluation results on Voice Search test set and compare with our reproduced baselines.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments.
Software Dependencies No The paper mentions 'Tensor Flow and Py Torch' as platforms where a function can be implemented but does not provide specific version numbers for these or any other software dependencies, libraries, or solvers used in the experiments.
Experiment Setup Yes We train our models exactly following our baselines Context Net (Han et al., 2020) and Conformer (Gulati et al., 2020), using Adam optimizer (Kingma & Ba, 2014), Spec Augment (Park et al., 2019) and a transformer learning rate schedule (Vaswani et al., 2017) with warm-up (Goyal et al., 2017). Our main results are summarized in Table 2 and Table 3. We also add a streaming Context Net Look-ahead baseline (6 frames, 10ms per frame, totally 60ms look-ahead latency) in Table 3 by padding additional frames at the end of the input utterances. We do a small-scale hyper-parameter sweep from -2 to 2 frames to shift for Context Net and Conformer in our experiments.