Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
Authors: Sehoon Kim, Amir Gholami, Albert Shaw, Nicholas Lee, Karttikeya Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate (WER) on Libri Speech test-other without external language models, which are 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online [25]. |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley 2ICSI 3LBNL {sehoonkim, amirgh, nicholas_lee, mangalam, malik, mahoneymw, keutzer}@berkeley.edu Albertshaw@google.com |
| Pseudocode | No | The paper describes the architecture and modifications using text and diagrams, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is open-sourced and available online [25]. Our code along with the checkpoints for all of the trained models is open-sourced and available online [25]. |
| Open Datasets | Yes | We train both Conformer-CTC and Squeezeformer on the Libri Speech-960hr [41] and Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206 5210, 2015. |
| Dataset Splits | Yes | Table 3: WER (%) comparison on Libri Speech dev and test datasets for Squeezeformer and other state-of-the-art CTC models for ASR including Conformer-CTC, Quartz Net [27], Citri Net [36], Transformer-CTC [31], and Efficient Conformer-CTC [4]. For comparison, we include the number of parameters, FLOPs, and throughput (Thp) on a single NVIDIA Tesla A100 GPU for a 30s input in the last three columns. The performance numbers for Conformer-CTC are based on our own reproduction to the best performance as possible and the others are the reported numbers in their papers [4, 27, 36]. With and without the grouped attention. Model dev-clean dev-other test-clean test-other Params (M) GFLOPs Thp (ex/s) |
| Hardware Specification | Yes | We train both Conformer-CTC and Squeezeformer on the Libri Speech-960hr [41] for 500 epochs on Google s cloud TPUs v3 with batch size 1024 for the small and medium variants and 2048 for the large variants. Thp (ex/s) on a single NVIDIA Tesla A100 GPU |
| Software Dependencies | No | The paper mentions using AdamW optimizer, but does not provide specific version numbers for any software, libraries, or frameworks used in the experiments (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | We train both Conformer-CTC and Squeezeformer on the Libri Speech-960hr [41] for 500 epochs on Google s cloud TPUs v3 with batch size 1024 for the small and medium variants and 2048 for the large variants. We use Adam W [33] optimizer with weight decay 5e-4 for all models. |