Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction

Authors: Jiatong Shi, Hirofumi Inaguma, Xutai Ma, Ilia Kulikov, Anna Sun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results indicate that the proposed model not only achieves more efficient inference but also exhibits superior or comparable performance to the original Hu BERT model over various tasks. Specifically, significant performance improvements over the original Hu BERT have been observed in fine-tuning experiments on the Libri Speech speech recognition benchmark as well as in evaluations using the Speech Universal PERformance Benchmark (SUPERB) and Multilingual SUPERB (ML-SUPERB).
Researcher Affiliation Collaboration Jiatong Shi1 , Hirofumi Inaguma2, Xutai Ma2, Ilia Kulikov2, Anna Sun2 1 Language Technologies Institute, Carnegie Mellon University; 2 Meta AI jiatongs@cs.cmu.edu {hirofumii, xutaima, kulikov, annaysun}@meta.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes the architecture and processes in text and diagrams.
Open Source Code Yes We have made the implementation of MR-Hu BERT, along with the pre-trained models, available as open-source resources on Fairseq and S3PRL (Ott et al., 2019; Yang et al., 2021).1 1Fairseq: https://github.com/facebookresearch/fairseq/tree/main/examples/mr_hubert; S3PRL: https: //s3prl.github.io/s3prl/tutorial/upstream_collection.html#multiresolution-hubert-mr-hubert.
Open Datasets Yes Datasets: We perform pre-training on three corpora: Libri Speech (Panayotov et al., 2015), Libri Light (Kahn et al., 2020), and Voxpopuli (Wang et al., 2021a). Libri Speech and Libri Light focus exclusively on English, while Voxpopuli is a multilingual dataset encompassing 23 European languages. The total dataset sizes amount to 960 hours for Libri Speech, 60,000 hours for Libri Light, and 100,000 hours for Voxpopuli.
Dataset Splits Yes We conduct speech recognition experiments using various subsets of the Libri Speech corpus for training. Specifically, we fine-tune the SSL models as a whole encoder using 1-hour, 10-hour, and 100-hour training subsets. Subsequently, we evaluate each fine-tuned model on four evaluation sets, namely dev-clean, test-clean, dev-other, and test-other.
Hardware Specification Yes All model training was executed on V100-32GB GPUs using the Fariseq toolkit (Ott et al., 2019). ... four models, specifically (B.8)-e-(B.8)-h, are trained on 128 A100-80GB GPUs.
Software Dependencies No The paper mentions using 'Fariseq toolkit' and 'Torch Profile toolkit' and 'S3PRL' but does not specify version numbers for these software components.
Experiment Setup Yes Model Configuration: ... base model uses a four-layer Transformer for each encoder, whereas the large model deploys an eight-layer Transformer for each encoder. ... Pre-trained Models: ... trained on Libri Speech (960 hours) and Libri Light (60,000 hours) respectively for 400,000 steps. The multi-base model is trained on Voxpopuli (384,000 hours) for 800,000 steps. More training details are available in Appendix A. Appendix A Table 5 provides: Num. GPU, Num. Frames, Grad. Accum., Num. Steps, Optimizer (Adamw), Learning Rate, Warmup Steps, Dropout, Loss Weights (β, γ), Audio Norm.