IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian Languages
Authors: Tahir Javed, Kaushal Bhogale, Abhigyan Raman, Pratyush Kumar, Anoop Kunchukuttan, Mitesh M. Khapra
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we extend this to Indic languages by releasing the Indic SUPERB benchmark. Specifically, we make the following three contributions. (i) We collect Kathbath containing 1,684 hours of labelled speech data across 12 Indian languages from 1,218 contributors located in 203 districts in India. (ii) Using Kathbath, we create benchmarks across 6 speech tasks: Automatic Speech Recognition, Speaker Verification, Speaker Identification (mono/multi), Language Identification, Query By Example, and Keyword Spotting for 12 languages. (iii) On the released benchmarks, we train and evaluate different self-supervised models alongside a commonly used baseline FBANK. We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks, including a large gap of 76% for Language Identification task. |
| Researcher Affiliation | Collaboration | 1Indian Institute of Technology Madras 2AI4Bharat 3Microsoft {tahir, cs22d006}@cse.iitm.ac.in, ramanabhigyan@gmail.com pratyush@cse.iitm.ac.in, ankunchu@microsoft.com, miteshk@cse.iitm.ac.in |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All the code, datasets and models developed as a part of this work have been made publicly available2 and we hope that they will help in furthering research on speech technology for Indian languages. 2https://github.com/AI4Bharat/indic SUPERB |
| Open Datasets | Yes | Our first main contribution is to build a large dataset for automatic speech recognition containing read speech across 12 Indian languages from 1,218 speakers spanning 203 districts (see Figure 1). This is a one of its kind open-source effort resulting in a very large dataset containing 1,684 hours of labelled speech recognition data. |
| Dataset Splits | Yes | Train, Validation and Test Splits Our goal was to provide training as well as evaluation data for various SLU tasks. Further, we wanted that the benchmark should support different conditions, e.g., (i) evaluation for speakers existing in the training data (ii) evaluation for speakers not existing in the training data (iii) evaluation on noisy data. To enable this, we divided the data into training set and multiple validation and test sets as explained below. [...] Validation: We create this using the same procedure as used for creating the test-known set. |
| Hardware Specification | No | We would like to thank the Ministry of Electronics and Information Technology (Meit Y6) of the Government of India and the Centre for Development of Advanced Computing (C-DAC7), Pune for generously supporting this work and providing us access to multiple GPU nodes on the Param Siddhi Supercomputer. |
| Software Dependencies | No | The paper mentions using the 's3prl framework' but does not specify its version number or any other software dependencies with versions. |
| Experiment Setup | Yes | ASR: Recent work on Indian languages has shown that wav2vec2 based models perform well on a wide variety of ASR benchmarks (Javed et al. 2022). Following this, we evaluate two wav2vec2 based models, viz., Indic Wav2Vec and XLS-R (Babu et al. 2021). [...] We restricted the output vocabulary of the model to characters only. In addition to the acoustic model, we also trained a 6-gram Ken LM language model for each language using all the sentences from Indic Corp (ranging from 8M to 87M for different languages). During decoding, we combine the scores of the acoustic model and the language model. We use a beam size of 128 and set the LM weight and word score parameter to 2 and -1 respectively. For all our experiments we use WER (word error rate) to measure the performance. |