Zipformer: A faster and better encoder for automatic speech recognition
Authors: Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang, Zengrui Jin, Long Lin, Daniel Povey
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Libri Speech, Aishell-1, and Wenet Speech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. |
| Researcher Affiliation | Industry | Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang, Zengrui Jin, Long Lin, Daniel Povey Xiaomi Corp., Beijing, China dpovey@xiaomi.com |
| Pseudocode | Yes | Algorithm 1 in Appendix Section A.1.1 presents the pseudo-code of the Scaled Adam. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/k2-fsa/icefall. |
| Open Datasets | Yes | We perform experiments to compare our Zipformer with state-of-the-other models on three open-source datasets: 1) Libri Speech (Panayotov et al., 2015) which consists of about 1000 hours of English audiobook reading; 2) Aishell-1 (Bu et al., 2017) which contains 170 hours of Mandarin speech; 3) Wenet Speech (Zhang et al., 2022a) which consists of 10000+ hours of multi-domain Mandarin speech. |
| Dataset Splits | No | The paper uses 'test-clean' and 'test-other' for Libri Speech and 'Dev' and 'Test' for Aishell-1 and Wenet Speech, implying standard splits. However, it does not explicitly state the numerical percentages or sample counts for training, validation, and test splits for any dataset, nor does it cite a source that defines these exact splits. |
| Hardware Specification | Yes | By default, all of our models are trained on 32GB NVIDIA Tesla V100 GPUs. For Librispeech dataset, Zipformer-M and Zipformer-L are trained for 50 epochs on 4 GPUs, and Zipformer-S is trained for 50 epochs on 2 GPUs. For Aishell-1 dataset, our models are trained for 56 epochs on 2 GPUs. For Wenet Speech dataset, our models are trained for 14 epochs on 4 GPUs. Trained with 8 80G NVIDIA Tesla A100 GPUs for 170 epochs. |
| Software Dependencies | No | The paper mentions using 'Lhotse (Zelasko et al., 2021) toolkit' for data preparation and 'Deep Speed (Rasley et al., 2020)' for FLOPs measurement. However, specific version numbers for these, or other core software like PyTorch/TensorFlow, are not provided. |
| Experiment Setup | Yes | The model inputs are 80-dimension Mel filter-bank features extracted on 25ms frames with frame shift of 10ms. Speed perturbation (Ko et al., 2015) with factors of 0.9, 1.0, and 1.1 is used to augment the training data. Spec Augment (Park et al., 2019) is also applied during training. We use mixed precision training for our Zipformer models. We also employ the activation constraints including Balancer and Whitener... Pruned transducer (Kuang et al., 2022)... During decoding, beam search of size 4 with the constraint of emitting at most one symbol per frame is employed... The proposed Eden learning rate schedule is formulated as: ... we use αbase = 0.045, αstart = 0.5, and twarmup = 500. Table 1: Configuration of Zipformer at three different scales. |