reproducibilityindex.ai

Word-Level Speech Recognition With a Letter to Word Encoder

Authors: Ronan Collobert, Awni Hannun, Gabriel Synnaeve

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show our direct-to-word model can achieve word error rate gains over sub-word level models for speech recognition. We validate our method on two commonly used models for end-to-end speech recognition. The ﬁrst is a Connectionist Temporal Classiﬁcation (CTC) model (Graves et al., 2006) and the second is a sequence-to-sequence model with attention (seq2seq) (Bahdanau et al., 2014; Cho et al., 2014; Sutskever et al., 2014). We show competitive performance with both approaches and demonstrate the advantage of predicting words directly, especially when decoding without the use of an external language model. We perform experiments on the Libri Speech corpus 960 hours of speech collected from open domain audio books (Panayotov et al., 2015).
Researcher Affiliation	Industry	1Facebook AI Research. Correspondence to: Ronan Collobert <locronan@fb.com>, Awni Hannun <awni@fb.com>, Gabriel Synnaeve <gab@fb.com>.
Pseudocode	No	The paper includes architectural diagrams (Figure 1, 2, 3) but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code	No	We use the open source WAV2LETTER++ toolkit (Pratap et al., 2018) to perform our experiments.
Open Datasets	Yes	We perform experiments on the Libri Speech corpus 960 hours of speech collected from open domain audio books (Panayotov et al., 2015).
Dataset Splits	Yes	All hyper-parameters are tuned according to the word error rates on the standard validation sets. Final test set performance is reported for both the CLEAN and OTHER settings (the latter being a subset of the data with noisier utterances). We compare the validation WER on both the CLEAN and OTHER conditions.
Hardware Specification	No	We train with Stochastic Gradient Descent, with a mini-batch size of 128 samples split evenly across eight GPUs.
Software Dependencies	No	We use the open source WAV2LETTER++ toolkit (Pratap et al., 2018) to perform our experiments.
Experiment Setup	Yes	We train with Stochastic Gradient Descent, with a mini-batch size of 128 samples split evenly across eight GPUs. ... We found that the norm of the output of the acoustic and word embedding models were slowly growing in magnitude causing numerical instability and ultimately leading to divergence. We circumvent the issue by constraining the norm of the embeddings to lie in an L2-ball. Keeping f am t 2 5 and f wd t 2 5 for both the acoustic and word model embeddings stabilized training.