reproducibilityindex.ai

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Authors: Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Michael Auli

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show for the ﬁrst time that learning powerful representations from speech audio alone followed by ﬁne-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task deﬁned over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets.
Researcher Affiliation	Industry	Alexei Baevski Henry Zhou Abdelrahman Mohamed Michael Auli {abaevski,henryzhou7,abdo,michaelauli}@fb.com Facebook AI
Pseudocode	No	The paper describes its masking strategy, objective function, and fine-tuning process using textual descriptions and mathematical formulas (e.g., L = Lm + Ld, Lm = log exp(sim(ct, qt)/ ) ...). However, it does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and models are available at https://github.com/pytorch/fairseq
Open Datasets	Yes	As unlabeled data we consider the Librispeech corpus [40] without transcriptions containing 960 hours of audio (LS-960) or the audio data from Libri Vox (LV-60k)... We ﬁne-tune the pre-trained models for phoneme recognition on the TIMIT dataset [13].
Dataset Splits	Yes	We ﬁne-tune on ﬁve labeled data settings: 960 hours of transcribed Librispeech, the train-clean-100 subset comprising 100 hours (100 hours labeled), as well as the Libri-light limited resource training subsets originally extracted from Librispeech, these are train-10h (10 hours labeled), train-1h (1 hour labeled), train-10min (10 min labeled)... We use the standard train, dev and test split [for TIMIT].
Hardware Specification	Yes	Batches are built by cropping 250k audio samples, or 15.6sec, from each example. Crops are batched together to not exceed 1.4m samples per GPU and we train on a total of 64 V100 GPUs for 1.6 days [38]... The LARGE model... train on 128 V100 GPUs over 2.3 days for Librispeech and 5.2 days for Libri Vox.
Software Dependencies	No	Models are implemented in fairseq [39]... We optimize with Adam [29]. No specific version numbers for these software dependencies are provided in the text.
Experiment Setup	Yes	For masking, we sample p = 0.065 of all time-steps to be starting indices and mask the subsequent M = 10 time-steps... BASE contains 12 transformer blocks, model dimension 768, inner dimension (FFN) 3,072 and 8 attention heads... We optimize with Adam [29], warming up the learning rate for the ﬁrst 8% of updates to a peak of 5 × 10−4 for BASE and 3 × 10−4 for LARGE... We use weight = 0.1 for the diversity loss Equation 2. For the quantization module we use G = 2 and V = 320 for both models... The Gumbel softmax temperature is annealed from 2 to a minimum of 0.5 for BASE and 0.1 for LARGE by a factor of 0.999995 at every update. The temperature in the contrastive loss (Equation 3) is set to τ = 0.1.