reproducibilityindex.ai

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

Authors: Alexei Baevski, Steffen Schneider, Michael Auli

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classiﬁcation and WSJ speech recognition.
Researcher Affiliation	Collaboration	Facebook AI Research, Menlo Park, CA, USA University of T ubingen, Germany
Pseudocode	No	No pseudocode or clearly labeled algorithm blocks are present. The paper describes algorithms mathematically and with diagrams.
Open Source Code	Yes	1The code will be made available at http://github.com/pytorch/fairseq.
Open Datasets	Yes	We generally pre-train vq-wav2vec and BERT on the full 960h of Librispeech (Panayotov et al., 2015)
Dataset Splits	Yes	We generally pre-train vq-wav2vec and BERT on the full 960h of Librispeech (Panayotov et al., 2015)... We evaluate models on two benchmarks: TIMIT (Garofolo et al., 1993b) is a 5h dataset with phoneme labels and Wall Street Journal (WSJ; Garofolo et al. 1993a) is a 81h dataset for speech recognition. For TIMIT, we apply the standard evaluation protocol... Tables 1, 2, 3 show results on nov93dev and dev PER.
Hardware Specification	No	All models are trained on 8 GPUs. We train on 128 GPUs with a batch size of 3072 tokens per GPU. No specific GPU or CPU model information is provided.
Software Dependencies	No	The paper mentions software like fairseq, wav2letter, Ken LM, and ffmpeg, along with their associated publications or developer names, but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We train with the wav2vec context prediction loss (Equation 1) for 400k updates, predicting K = 8 steps into the future and sample 10 negatives from the same audio example. Training is warmed up for 500 steps where the learning rate is increased from 1 10 7 to 5 10 3, and then annealed to 1 10 6 using a cosine schedule (Loshchilov & Hutter, 2016). The batch size is 10, and we crop a random section of 150k frames for each example. For Gumbel-Softmax Models, We use G = 2 groups and V = 320 latents per group... The temperature τ is linearly annealed from 2 to 0.5. For k-means Models, we found γ = 0.25. BERT base models have 12 layers, model dimension 768, inner dimension (FFN) 3072 and 12 attention heads. The learning rate is warmed up over the ﬁrst 10,000 updates to a peak value of 1 10 5, and then linearly decayed over a total of 250k updates. We train on 128 GPUs with a batch size of 3072 tokens per GPU.