vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
Authors: Alexei Baevski, Steffen Schneider, Michael Auli
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition. |
| Researcher Affiliation | Collaboration | Facebook AI Research, Menlo Park, CA, USA University of T ubingen, Germany |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks are present. The paper describes algorithms mathematically and with diagrams. |
| Open Source Code | Yes | 1The code will be made available at http://github.com/pytorch/fairseq. |
| Open Datasets | Yes | We generally pre-train vq-wav2vec and BERT on the full 960h of Librispeech (Panayotov et al., 2015) |
| Dataset Splits | Yes | We generally pre-train vq-wav2vec and BERT on the full 960h of Librispeech (Panayotov et al., 2015)... We evaluate models on two benchmarks: TIMIT (Garofolo et al., 1993b) is a 5h dataset with phoneme labels and Wall Street Journal (WSJ; Garofolo et al. 1993a) is a 81h dataset for speech recognition. For TIMIT, we apply the standard evaluation protocol... Tables 1, 2, 3 show results on nov93dev and dev PER. |
| Hardware Specification | No | All models are trained on 8 GPUs. We train on 128 GPUs with a batch size of 3072 tokens per GPU. No specific GPU or CPU model information is provided. |
| Software Dependencies | No | The paper mentions software like fairseq, wav2letter, Ken LM, and ffmpeg, along with their associated publications or developer names, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We train with the wav2vec context prediction loss (Equation 1) for 400k updates, predicting K = 8 steps into the future and sample 10 negatives from the same audio example. Training is warmed up for 500 steps where the learning rate is increased from 1 10 7 to 5 10 3, and then annealed to 1 10 6 using a cosine schedule (Loshchilov & Hutter, 2016). The batch size is 10, and we crop a random section of 150k frames for each example. For Gumbel-Softmax Models, We use G = 2 groups and V = 320 latents per group... The temperature τ is linearly annealed from 2 to 0.5. For k-means Models, we found γ = 0.25. BERT base models have 12 layers, model dimension 768, inner dimension (FFN) 3072 and 12 attention heads. The learning rate is warmed up over the first 10,000 updates to a peak value of 1 10 5, and then linearly decayed over a total of 250k updates. We train on 128 GPUs with a batch size of 3072 tokens per GPU. |