reproducibilityindex.ai

Self-supervised learning with random-projection quantizer for speech recognition

Authors: Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, Yonghui Wu

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On Libri Speech our approach achieves similar word-error-rates as previous work using selfsupervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides signiﬁcant improvement over wav2vec 2.0 and w2v-BERT.
Researcher Affiliation	Industry	1Google Research, Brain Team. Correspondence to: Chung-Cheng Chiu <chungchengc@google.com>, James Qin <jamesqin@google.com>, Yu Zhang <ngyuzh@google.com>.
Pseudocode	No	No section or figure explicitly labeled 'Pseudocode' or 'Algorithm' was found in the paper.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	We conduct experiments on the Libri Light dataset (Kahn et al., 2020) for pre-training, and ﬁne-tune on the Libri Speech training set which contains 960 hours of data.
Dataset Splits	No	The paper uses standard evaluation splits like 'dev', 'dev-other', 'test', and 'test-other' for Libri Speech in its results tables. However, it does not explicitly state the specific percentages or absolute counts for how the training, validation, and test datasets are partitioned, beyond mentioning 'Libri Speech training set which contains 960 hours of data'.
Hardware Specification	No	No specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or TPU versions used for running experiments are provided in the paper. It only mentions using the Lingvo library.
Software Dependencies	No	The implementation use Lingvo (Shen et al., 2019) library. No specific version numbers for Lingvo or any other software dependencies are provided.
Experiment Setup	Yes	Pre-train. The pre-training uses mask length 400ms with masking probability of 0.01. The learning rate schedule uses a transformer learning rate schedule (Vaswani et al., 2017). The training of the model uses Adam optimizer (Kingma & Ba, 2015) with 0.004 peak learning rate and 25000 warmup steps. The batch size is 2048. Fine-tune. The encoder use 0.0003 peak learning rate and 5000 warmup steps, while the decoder use 0.001 peak learning rate and 1500 warmup steps. The streaming pre-training uses the same setup as the original architecture, and the mask length is 300ms and the masking probability is 0.02.