reproducibilityindex.ai

SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training

Authors: Wenyong Huang, Zhenhe Zhang, Yu Ting Yeung, Xin Jiang, Qun Liu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the effectiveness of SPIRAL, we conduct experiments on Libri Speech and Libri-Light datasets. By training a small convolutional classiﬁer on the representation of a frozen SPIRAL model, we can achieve WER of 3.5% and 6.4% on Librispeech test-clean and test-other respectively. SPIRAL achieves competitive or better results compared to state-of-the-art speech pre-training methods, while being much more training-efﬁcient. We also demonstrate that multi-condition pretrained SPIRAL are more robust to noisy speech with 9.0% 13.3% relative word error rate (WER) reduction on real noisy test data from Chi ME-3 (Barker et al., 2015), compared to the model applying multi-condition training solely in ﬁne-tuning stage. We do ablation studies to understand the role of predictor and projection head in SPIRAL.
Researcher Affiliation	Industry	Wenyong Huang, Zhenhe Zhang, Yu Ting Yeung, Xin Jiang, Qun Liu Huawei Noah s Ark Lab {wenyong.huang,zhangzhenhe1,yeung.yu.ting}@huawei.com {jiang.xin,qun.liu}@huawei.com
Pseudocode	No	The paper contains a diagram (Figure 1) illustrating the architecture, but no formal pseudocode or algorithm blocks.
Open Source Code	Yes	Source code is available 1. 1https://github.com/huawei-noah/Speech-Backbones/tree/main/SPIRAL
Open Datasets	Yes	For pre-training, we use the 960-hour training data (ignoring the labels) from Libri Speech (Panayotov et al., 2015)(LS-960), or 60k-hour unlabeled audio data from Libri-Light (Kahn et al., 2020b) (LL-60K). The two datasets are both derived from English audiobooks from Libri Vox project2. 2https://librivox.org/
Dataset Splits	Yes	For ASR ﬁne-tuning, we apply 100-hour subset (train-clean-100) as low-resource labeled data and entire LS-960 with labels as high-resource labeled data, both from Libri Speech. For multi-condition training, we use the noise dataset from Reddy et al. (2021). The dataset consists of 181 hours of noise data with about 150 noise types and 70,000 clips. We shufﬂe and split the noise data with a ratio of 8:1:1, which are used for training, synthesizing noisy dev-sets and synthetic noisy test-sets (results in Appendix A.2) respectively.
Hardware Specification	Yes	We train the BASE model with batch size of 24 per GPU for 200k steps on 16 V100 GPUs, which takes about 1.3 days. For the LARGE model, we train with batch size of 20 per GPU for 500k steps on 32 V100 GPUs, which takes about 7.25 days.
Software Dependencies	No	The paper does not explicitly state specific version numbers for software dependencies like Python, PyTorch, or other libraries used for implementation.
Experiment Setup	Yes	We apply 128-dimensional log-mel ﬁlterbank extracted with 20 ms window and 10 ms stride as the input acoustic feature. In pre-training, we optimize with Adam (Kingma & Ba, 2015) optimizer, warming up the learning rate for the ﬁrst 8% of updates to a peak of 3e-3. Then the learning rate decays to 0 with a cosine schedule. The moving average update rate αt of teacher s weight also follows a cosine schedule (Grill et al., 2020). We increase αt from 0.995 to 1.0 and from 0.990 to 0.999 for BASE and LARGE models respectively. We train the BASE model with batch size of 24 per GPU for 200k steps on 16 V100 GPUs, which takes about 1.3 days.