Self-supervised learning with random-projection quantizer for speech recognition
Authors: Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, Yonghui Wu
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On Libri Speech our approach achieves similar word-error-rates as previous work using selfsupervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT. |
| Researcher Affiliation | Industry | 1Google Research, Brain Team. Correspondence to: Chung-Cheng Chiu <chungchengc@google.com>, James Qin <jamesqin@google.com>, Yu Zhang <ngyuzh@google.com>. |
| Pseudocode | No | No section or figure explicitly labeled 'Pseudocode' or 'Algorithm' was found in the paper. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We conduct experiments on the Libri Light dataset (Kahn et al., 2020) for pre-training, and fine-tune on the Libri Speech training set which contains 960 hours of data. |
| Dataset Splits | No | The paper uses standard evaluation splits like 'dev', 'dev-other', 'test', and 'test-other' for Libri Speech in its results tables. However, it does not explicitly state the specific percentages or absolute counts for how the training, validation, and test datasets are partitioned, beyond mentioning 'Libri Speech training set which contains 960 hours of data'. |
| Hardware Specification | No | No specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or TPU versions used for running experiments are provided in the paper. It only mentions using the Lingvo library. |
| Software Dependencies | No | The implementation use Lingvo (Shen et al., 2019) library. No specific version numbers for Lingvo or any other software dependencies are provided. |
| Experiment Setup | Yes | Pre-train. The pre-training uses mask length 400ms with masking probability of 0.01. The learning rate schedule uses a transformer learning rate schedule (Vaswani et al., 2017). The training of the model uses Adam optimizer (Kingma & Ba, 2015) with 0.004 peak learning rate and 25000 warmup steps. The batch size is 2048. Fine-tune. The encoder use 0.0003 peak learning rate and 5000 warmup steps, while the decoder use 0.001 peak learning rate and 1500 warmup steps. The streaming pre-training uses the same setup as the original architecture, and the mask length is 300ms and the masking probability is 0.02. |