UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

Authors: Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the effectiveness of Uni Speech for cross-lingual representation learning on the public Common Voice corpus. The results show that Uni Speech outperforms self-supervised pre-training and supervised transfer learning for speech recognition by up to 13.4% and 26.9% relative phone error rate respectively (averaged over all testing languages).
Researcher Affiliation Collaboration Chengyi Wang * 1 Yu Wu 2 Yao Qian 2 Kenichi Kumatani 2 Shujie Liu 2 Furu Wei 2 Michael Zeng 2 Xuedong Huang 2 Work done during internship at Microsoft. 1Nankai University, Tianjin, China 2Microsoft. Correspondence to: Chengyi Wang <cywang@mail.nankai.edu.cn>, Yu Wu <yuwu1@microsoft.com>, Yao Qian <yaoqian@microsoft.com>.
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Code and model are available at https://github.com/ cywang97/unispeech
Open Datasets Yes We employ the Common Voice (CV) dataset (Ardila et al., 2019) 2, which is a multilingual corpus of read speech comprising more than 5k hours of speech data in 60 languages. To be comparable with XLSR (Conneau et al., 2020), we consider the following eight languages for evaluation3: Spanish (es), French (fr), Italian (it), Kyrgyz (ky), Dutch (nl), Russian (ru), Swedish (sv) and Tatar (tt). English (en) is always regarded as a high-resource language.
Dataset Splits Yes For fine-tuning, we use the evaluation splits from Rivi ere et al. (2020), which contains 1 hour paired data for training, 20 minutes for validation and 1 hour for testing.
Hardware Specification No The paper states models were trained on 64 GPUs, but does not specify the type or model of the GPUs (e.g., NVIDIA A100, Tesla V100).
Software Dependencies Yes Models are implemented in fairseq (Ott et al., 2019). We use Adam optimizer.
Experiment Setup Yes To train the Uni Speech model, we use mask probability p = 0.05, loss weight α = 0.5 and replace probability r = 0.5 unless otherwise stated. During pre-training, we crop each utterance to 250k samples for Base model and 320k samples for Large model. Each batch on one GPU contains max up to 1.4m samples for Base and 1.2m samples for Large. We use Adam optimizer where the learning rate is warmed up for the first 10% of updates to a peak of 5e-4(Base) or 1e-3(Large) and then linearly decayed over a total of 250k updates. The model is fine-tuned with 2 GPUs. We still use Adam optimizer and the learning rate is warmed up for 2k updates to 2e-5, keep constant for 8k updates and then linearly decay for 10k updates. Dropout 0.1 is always used for both pre-training and finetuning.