Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Authors: Sehoon Kim, Amir Gholami, Albert Shaw, Nicholas Lee, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate (WER) on Libri Speech test-other without external language models, which are 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online [25].
Researcher Affiliation Collaboration 1University of California, Berkeley 2ICSI 3LBNL {sehoonkim, amirgh, nicholas_lee, mangalam, malik, mahoneymw, keutzer}@berkeley.edu Albertshaw@google.com
Pseudocode No The paper describes the architecture and modifications using text and diagrams, but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is open-sourced and available online [25]. Our code along with the checkpoints for all of the trained models is open-sourced and available online [25].
Open Datasets Yes We train both Conformer-CTC and Squeezeformer on the Libri Speech-960hr [41] and Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206 5210, 2015.
Dataset Splits Yes Table 3: WER (%) comparison on Libri Speech dev and test datasets for Squeezeformer and other state-of-the-art CTC models for ASR including Conformer-CTC, Quartz Net [27], Citri Net [36], Transformer-CTC [31], and Efficient Conformer-CTC [4]. For comparison, we include the number of parameters, FLOPs, and throughput (Thp) on a single NVIDIA Tesla A100 GPU for a 30s input in the last three columns. The performance numbers for Conformer-CTC are based on our own reproduction to the best performance as possible and the others are the reported numbers in their papers [4, 27, 36]. With and without the grouped attention. Model dev-clean dev-other test-clean test-other Params (M) GFLOPs Thp (ex/s)
Hardware Specification Yes We train both Conformer-CTC and Squeezeformer on the Libri Speech-960hr [41] for 500 epochs on Google s cloud TPUs v3 with batch size 1024 for the small and medium variants and 2048 for the large variants. Thp (ex/s) on a single NVIDIA Tesla A100 GPU
Software Dependencies No The paper mentions using AdamW optimizer, but does not provide specific version numbers for any software, libraries, or frameworks used in the experiments (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes We train both Conformer-CTC and Squeezeformer on the Libri Speech-960hr [41] for 500 epochs on Google s cloud TPUs v3 with batch size 1024 for the small and medium variants and 2048 for the large variants. We use Adam W [33] optimizer with weight decay 5e-4 for all models.