Towards Building ASR Systems for the Next Billion Users

Authors: Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Santosh Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra10813-10821

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Fourth, we fine-tune this model for downstream ASR for 9 languages and obtain state-of-the-art results on 3 public datasets, including on very low-resource languages such as Sinhala and Nepali. Our work establishes that multilingual pretraining is an effective strategy for building ASR systems for the linguistically diverse speakers of the Indian subcontinent.
Researcher Affiliation Collaboration 1IIT Madras, 2AI4Bharat, 3Microsoft, 4RBCDSAI
Pseudocode No The paper describes methods and procedures in text but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We have publicly released all the artifacts of our work3 to spur further work in the area of Indic ASR. This includes: (a) sources of pretraining data along scripts for their collection and pre-processing, (b) pretraining, fine-tuning and decoding scripts, and (c) our best ASR models. 3https://github.com/AI4Bharat/Indic Wav2Vec
Open Datasets Yes We experiment with 3 ASR datasets covering 9 Indian languages. These include the MSR (Microsoft Research) dataset which was released as a part of the Low Resource Speech Recognition Challenge for Indian Languages (Srivastava et al. 2018), the MUCS2021 dataset which was released as a part of the Multilingual and code-switching ASR challenges for low resource Indian languages (Diwan et al. 2021), and a subset of the Open SLR dataset (Kjartansson et al. 2018) obtained from the authors of Shetty and Umesh (2021).
Dataset Splits No The paper refers to
Hardware Specification Yes For the BASE model...train the model on 24 A100 GPUs... For the LARGE model...train the model on 24 A100 GPUs...
Software Dependencies No The paper mentions software like fairseq, Ken LM, Flashlight, py-webrtcvad, youtube-dl, and FFmpeg, but does not specify their version numbers.
Experiment Setup Yes Pretraining Setup We pretrain two variants of the model, viz., BASE and LARGE... For the BASE model...train the model on 24 A100 GPUs with gradient accumulation of 2 steps making the effective batch size as 3.2 hours. We used Adam optimizer with learning rate set to 0.0005 and decayed the learning rate polynomially after a warmup for 32k steps. We trained the model for 160k steps. ... Fine-tuning Setup During fine-tuning...we used Adam optimiser with a learning rate of 1e4 and tri-stage learning rate schedule...We set the maximum number of frames per GPU to 1M and fine-tune the models on 8 A100 GPUs without any gradient accumulation, making our effective batch size as 8M samples or 500 secs. We trained the BASE model for 80k steps and the LARGE model for 120k steps.