Towards Building ASR Systems for the Next Billion Users
Authors: Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Santosh Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra10813-10821
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Fourth, we fine-tune this model for downstream ASR for 9 languages and obtain state-of-the-art results on 3 public datasets, including on very low-resource languages such as Sinhala and Nepali. Our work establishes that multilingual pretraining is an effective strategy for building ASR systems for the linguistically diverse speakers of the Indian subcontinent. |
| Researcher Affiliation | Collaboration | 1IIT Madras, 2AI4Bharat, 3Microsoft, 4RBCDSAI |
| Pseudocode | No | The paper describes methods and procedures in text but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We have publicly released all the artifacts of our work3 to spur further work in the area of Indic ASR. This includes: (a) sources of pretraining data along scripts for their collection and pre-processing, (b) pretraining, fine-tuning and decoding scripts, and (c) our best ASR models. 3https://github.com/AI4Bharat/Indic Wav2Vec |
| Open Datasets | Yes | We experiment with 3 ASR datasets covering 9 Indian languages. These include the MSR (Microsoft Research) dataset which was released as a part of the Low Resource Speech Recognition Challenge for Indian Languages (Srivastava et al. 2018), the MUCS2021 dataset which was released as a part of the Multilingual and code-switching ASR challenges for low resource Indian languages (Diwan et al. 2021), and a subset of the Open SLR dataset (Kjartansson et al. 2018) obtained from the authors of Shetty and Umesh (2021). |
| Dataset Splits | No | The paper refers to |
| Hardware Specification | Yes | For the BASE model...train the model on 24 A100 GPUs... For the LARGE model...train the model on 24 A100 GPUs... |
| Software Dependencies | No | The paper mentions software like fairseq, Ken LM, Flashlight, py-webrtcvad, youtube-dl, and FFmpeg, but does not specify their version numbers. |
| Experiment Setup | Yes | Pretraining Setup We pretrain two variants of the model, viz., BASE and LARGE... For the BASE model...train the model on 24 A100 GPUs with gradient accumulation of 2 steps making the effective batch size as 3.2 hours. We used Adam optimizer with learning rate set to 0.0005 and decayed the learning rate polynomially after a warmup for 32k steps. We trained the model for 160k steps. ... Fine-tuning Setup During fine-tuning...we used Adam optimiser with a learning rate of 1e4 and tri-stage learning rate schedule...We set the maximum number of frames per GPU to 1M and fine-tune the models on 8 A100 GPUs without any gradient accumulation, making our effective batch size as 8M samples or 500 secs. We trained the BASE model for 80k steps and the LARGE model for 120k steps. |