vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

Authors: Alexei Baevski, Steffen Schneider, Michael Auli

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.
Researcher Affiliation Collaboration Facebook AI Research, Menlo Park, CA, USA University of T ubingen, Germany
Pseudocode No No pseudocode or clearly labeled algorithm blocks are present. The paper describes algorithms mathematically and with diagrams.
Open Source Code Yes 1The code will be made available at http://github.com/pytorch/fairseq.
Open Datasets Yes We generally pre-train vq-wav2vec and BERT on the full 960h of Librispeech (Panayotov et al., 2015)
Dataset Splits Yes We generally pre-train vq-wav2vec and BERT on the full 960h of Librispeech (Panayotov et al., 2015)... We evaluate models on two benchmarks: TIMIT (Garofolo et al., 1993b) is a 5h dataset with phoneme labels and Wall Street Journal (WSJ; Garofolo et al. 1993a) is a 81h dataset for speech recognition. For TIMIT, we apply the standard evaluation protocol... Tables 1, 2, 3 show results on nov93dev and dev PER.
Hardware Specification No All models are trained on 8 GPUs. We train on 128 GPUs with a batch size of 3072 tokens per GPU. No specific GPU or CPU model information is provided.
Software Dependencies No The paper mentions software like fairseq, wav2letter, Ken LM, and ffmpeg, along with their associated publications or developer names, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We train with the wav2vec context prediction loss (Equation 1) for 400k updates, predicting K = 8 steps into the future and sample 10 negatives from the same audio example. Training is warmed up for 500 steps where the learning rate is increased from 1 10 7 to 5 10 3, and then annealed to 1 10 6 using a cosine schedule (Loshchilov & Hutter, 2016). The batch size is 10, and we crop a random section of 150k frames for each example. For Gumbel-Softmax Models, We use G = 2 groups and V = 320 latents per group... The temperature τ is linearly annealed from 2 to 0.5. For k-means Models, we found γ = 0.25. BERT base models have 12 layers, model dimension 768, inner dimension (FFN) 3072 and 12 attention heads. The learning rate is warmed up over the first 10,000 updates to a peak value of 1 10 5, and then linearly decayed over a total of 250k updates. We train on 128 GPUs with a batch size of 3072 tokens per GPU.