wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Authors: Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Michael Auli

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets.
Researcher Affiliation Industry Alexei Baevski Henry Zhou Abdelrahman Mohamed Michael Auli {abaevski,henryzhou7,abdo,michaelauli}@fb.com Facebook AI
Pseudocode No The paper describes its masking strategy, objective function, and fine-tuning process using textual descriptions and mathematical formulas (e.g., L = Lm + Ld, Lm = log exp(sim(ct, qt)/ ) ...). However, it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code and models are available at https://github.com/pytorch/fairseq
Open Datasets Yes As unlabeled data we consider the Librispeech corpus [40] without transcriptions containing 960 hours of audio (LS-960) or the audio data from Libri Vox (LV-60k)... We fine-tune the pre-trained models for phoneme recognition on the TIMIT dataset [13].
Dataset Splits Yes We fine-tune on five labeled data settings: 960 hours of transcribed Librispeech, the train-clean-100 subset comprising 100 hours (100 hours labeled), as well as the Libri-light limited resource training subsets originally extracted from Librispeech, these are train-10h (10 hours labeled), train-1h (1 hour labeled), train-10min (10 min labeled)... We use the standard train, dev and test split [for TIMIT].
Hardware Specification Yes Batches are built by cropping 250k audio samples, or 15.6sec, from each example. Crops are batched together to not exceed 1.4m samples per GPU and we train on a total of 64 V100 GPUs for 1.6 days [38]... The LARGE model... train on 128 V100 GPUs over 2.3 days for Librispeech and 5.2 days for Libri Vox.
Software Dependencies No Models are implemented in fairseq [39]... We optimize with Adam [29]. No specific version numbers for these software dependencies are provided in the text.
Experiment Setup Yes For masking, we sample p = 0.065 of all time-steps to be starting indices and mask the subsequent M = 10 time-steps... BASE contains 12 transformer blocks, model dimension 768, inner dimension (FFN) 3,072 and 8 attention heads... We optimize with Adam [29], warming up the learning rate for the first 8% of updates to a peak of 5 × 10−4 for BASE and 3 × 10−4 for LARGE... We use weight = 0.1 for the diversity loss Equation 2. For the quantization module we use G = 2 and V = 320 for both models... The Gumbel softmax temperature is annealed from 2 to a minimum of 0.5 for BASE and 0.1 for LARGE by a factor of 0.999995 at every update. The temperature in the contrastive loss (Equation 3) is set to τ = 0.1.