Unsupervised Speech Recognition

Authors: Alexei Baevski, Wei-Ning Hsu, Alexis CONNEAU, Michael Auli

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the viability of the framework for a variety of settings and languages. wav2vec-U improves the phone error rate (PER) on the small-scale TIMIT benchmark from 26.1 to 11.3 compared to the next best known unsupervised approach. To get a better sense of the performance compared to the best supervised methods, we measure performance on the larger Librispeech benchmark where our method achieves word error rate (WER) 5.9 on test-other.
Researcher Affiliation Industry Alexei Baevski Wei-Ning Hsu Alexis Conneau Michael Auli Facebook AI Google AI
Pseudocode No The paper does not contain any structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes The code is available at https://github.com/pytorch/fairseq/tree/ master/examples/wav2vec/unsupervised
Open Datasets Yes Librispeech is a standard benchmark in the speech recognition community which provides about 960 hours of transcribed read audiobooks. We use the language modeling data of Librispeech as unlabeled text data for unsupervised training. We also consider self-training over three iterations by first training an HMM on the labels generated by the GANm then fine-tuning the original wav2vec 2.0 model on the labels of the HMM for Librispeech followed by then fine-tuning on Libri-Light; Appendix F investigates alternatives.
Dataset Splits Yes Librispeech provides clean dev/test sets which are less challenging than the other sets. We measure performance on the standard Kaldi dev and test sets (core-dev/core-test) as well as a slightly larger version of the test set (all-test) to be able to compare to Liu et al. [2018] and Chen et al. [2019].
Hardware Specification No The paper mentions using GPUs for fast clustering with the FAISS library but does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running the experiments or training the models.
Software Dependencies No The paper mentions software like fairseq, PyTorch, FAISS, and Kaldi, but does not specify version numbers for these or any other key software dependencies.
Experiment Setup No The paper describes the model architecture and objective function, including the penalties used (gradient penalty, segment smoothness penalty, phoneme diversity loss), but does not provide specific numerical values for hyperparameters such as learning rate, batch size, number of epochs, or the weights (λ, γ, η) for the loss components.