SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training

Authors: Wenyong Huang, Zhenhe Zhang, Yu Ting Yeung, Xin Jiang, Qun Liu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the effectiveness of SPIRAL, we conduct experiments on Libri Speech and Libri-Light datasets. By training a small convolutional classifier on the representation of a frozen SPIRAL model, we can achieve WER of 3.5% and 6.4% on Librispeech test-clean and test-other respectively. SPIRAL achieves competitive or better results compared to state-of-the-art speech pre-training methods, while being much more training-efficient. We also demonstrate that multi-condition pretrained SPIRAL are more robust to noisy speech with 9.0% 13.3% relative word error rate (WER) reduction on real noisy test data from Chi ME-3 (Barker et al., 2015), compared to the model applying multi-condition training solely in fine-tuning stage. We do ablation studies to understand the role of predictor and projection head in SPIRAL.
Researcher Affiliation Industry Wenyong Huang, Zhenhe Zhang, Yu Ting Yeung, Xin Jiang, Qun Liu Huawei Noah s Ark Lab {wenyong.huang,zhangzhenhe1,yeung.yu.ting}@huawei.com {jiang.xin,qun.liu}@huawei.com
Pseudocode No The paper contains a diagram (Figure 1) illustrating the architecture, but no formal pseudocode or algorithm blocks.
Open Source Code Yes Source code is available 1. 1https://github.com/huawei-noah/Speech-Backbones/tree/main/SPIRAL
Open Datasets Yes For pre-training, we use the 960-hour training data (ignoring the labels) from Libri Speech (Panayotov et al., 2015)(LS-960), or 60k-hour unlabeled audio data from Libri-Light (Kahn et al., 2020b) (LL-60K). The two datasets are both derived from English audiobooks from Libri Vox project2. 2https://librivox.org/
Dataset Splits Yes For ASR fine-tuning, we apply 100-hour subset (train-clean-100) as low-resource labeled data and entire LS-960 with labels as high-resource labeled data, both from Libri Speech. For multi-condition training, we use the noise dataset from Reddy et al. (2021). The dataset consists of 181 hours of noise data with about 150 noise types and 70,000 clips. We shuffle and split the noise data with a ratio of 8:1:1, which are used for training, synthesizing noisy dev-sets and synthetic noisy test-sets (results in Appendix A.2) respectively.
Hardware Specification Yes We train the BASE model with batch size of 24 per GPU for 200k steps on 16 V100 GPUs, which takes about 1.3 days. For the LARGE model, we train with batch size of 20 per GPU for 500k steps on 32 V100 GPUs, which takes about 7.25 days.
Software Dependencies No The paper does not explicitly state specific version numbers for software dependencies like Python, PyTorch, or other libraries used for implementation.
Experiment Setup Yes We apply 128-dimensional log-mel filterbank extracted with 20 ms window and 10 ms stride as the input acoustic feature. In pre-training, we optimize with Adam (Kingma & Ba, 2015) optimizer, warming up the learning rate for the first 8% of updates to a peak of 3e-3. Then the learning rate decays to 0 with a cosine schedule. The moving average update rate αt of teacher s weight also follows a cosine schedule (Grill et al., 2020). We increase αt from 0.995 to 1.0 and from 0.990 to 0.999 for BASE and LARGE models respectively. We train the BASE model with batch size of 24 per GPU for 200k steps on 16 V100 GPUs, which takes about 1.3 days.