Word-Level Speech Recognition With a Letter to Word Encoder

Authors: Ronan Collobert, Awni Hannun, Gabriel Synnaeve

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show our direct-to-word model can achieve word error rate gains over sub-word level models for speech recognition. We validate our method on two commonly used models for end-to-end speech recognition. The first is a Connectionist Temporal Classification (CTC) model (Graves et al., 2006) and the second is a sequence-to-sequence model with attention (seq2seq) (Bahdanau et al., 2014; Cho et al., 2014; Sutskever et al., 2014). We show competitive performance with both approaches and demonstrate the advantage of predicting words directly, especially when decoding without the use of an external language model. We perform experiments on the Libri Speech corpus 960 hours of speech collected from open domain audio books (Panayotov et al., 2015).
Researcher Affiliation Industry 1Facebook AI Research. Correspondence to: Ronan Collobert <locronan@fb.com>, Awni Hannun <awni@fb.com>, Gabriel Synnaeve <gab@fb.com>.
Pseudocode No The paper includes architectural diagrams (Figure 1, 2, 3) but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code No We use the open source WAV2LETTER++ toolkit (Pratap et al., 2018) to perform our experiments.
Open Datasets Yes We perform experiments on the Libri Speech corpus 960 hours of speech collected from open domain audio books (Panayotov et al., 2015).
Dataset Splits Yes All hyper-parameters are tuned according to the word error rates on the standard validation sets. Final test set performance is reported for both the CLEAN and OTHER settings (the latter being a subset of the data with noisier utterances). We compare the validation WER on both the CLEAN and OTHER conditions.
Hardware Specification No We train with Stochastic Gradient Descent, with a mini-batch size of 128 samples split evenly across eight GPUs.
Software Dependencies No We use the open source WAV2LETTER++ toolkit (Pratap et al., 2018) to perform our experiments.
Experiment Setup Yes We train with Stochastic Gradient Descent, with a mini-batch size of 128 samples split evenly across eight GPUs. ... We found that the norm of the output of the acoustic and word embedding models were slowly growing in magnitude causing numerical instability and ultimately leading to divergence. We circumvent the issue by constraining the norm of the embeddings to lie in an L2-ball. Keeping f am t 2 5 and f wd t 2 5 for both the acoustic and word model embeddings stabilized training.