wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
Authors: Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Michael Auli
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. |
| Researcher Affiliation | Industry | Alexei Baevski Henry Zhou Abdelrahman Mohamed Michael Auli {abaevski,henryzhou7,abdo,michaelauli}@fb.com Facebook AI |
| Pseudocode | No | The paper describes its masking strategy, objective function, and fine-tuning process using textual descriptions and mathematical formulas (e.g., L = Lm + Ld, Lm = log exp(sim(ct, qt)/ ) ...). However, it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models are available at https://github.com/pytorch/fairseq |
| Open Datasets | Yes | As unlabeled data we consider the Librispeech corpus [40] without transcriptions containing 960 hours of audio (LS-960) or the audio data from Libri Vox (LV-60k)... We fine-tune the pre-trained models for phoneme recognition on the TIMIT dataset [13]. |
| Dataset Splits | Yes | We fine-tune on five labeled data settings: 960 hours of transcribed Librispeech, the train-clean-100 subset comprising 100 hours (100 hours labeled), as well as the Libri-light limited resource training subsets originally extracted from Librispeech, these are train-10h (10 hours labeled), train-1h (1 hour labeled), train-10min (10 min labeled)... We use the standard train, dev and test split [for TIMIT]. |
| Hardware Specification | Yes | Batches are built by cropping 250k audio samples, or 15.6sec, from each example. Crops are batched together to not exceed 1.4m samples per GPU and we train on a total of 64 V100 GPUs for 1.6 days [38]... The LARGE model... train on 128 V100 GPUs over 2.3 days for Librispeech and 5.2 days for Libri Vox. |
| Software Dependencies | No | Models are implemented in fairseq [39]... We optimize with Adam [29]. No specific version numbers for these software dependencies are provided in the text. |
| Experiment Setup | Yes | For masking, we sample p = 0.065 of all time-steps to be starting indices and mask the subsequent M = 10 time-steps... BASE contains 12 transformer blocks, model dimension 768, inner dimension (FFN) 3,072 and 8 attention heads... We optimize with Adam [29], warming up the learning rate for the first 8% of updates to a peak of 5 × 10−4 for BASE and 3 × 10−4 for LARGE... We use weight = 0.1 for the diversity loss Equation 2. For the quantization module we use G = 2 and V = 320 for both models... The Gumbel softmax temperature is annealed from 2 to a minimum of 0.5 for BASE and 0.1 for LARGE by a factor of 0.999995 at every update. The temperature in the contrastive loss (Equation 3) is set to τ = 0.1. |