Towards End-To-End Speech Recognition with Recurrent Neural Networks
Authors: Alex Graves, Navdeep Jaitly
ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the Wall Street Journal speech corpus demonstrate that the system is able to recognise words to reasonable accuracy, even in the absence of a language model or dictionary, and that when combined with a language model it performs comparably to a state-of-the-art pipeline. The experiments were carried out on the Wall Street Journal (WSJ) corpus (available as LDC corpus LDC93S6B and LDC94S13B). |
| Researcher Affiliation | Collaboration | Alex Graves GRAVES@CS.TORONTO.EDU Google Deep Mind, London, United Kingdom Navdeep Jaitly NDJAITLY@CS.TORONTO.EDU Department of Computer Science, University of Toronto, Canada |
| Pseudocode | Yes | Algorithm 1 CTC Beam Search |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code for the described methodology, nor does it include links to a code repository. |
| Open Datasets | Yes | The experiments were carried out on the Wall Street Journal (WSJ) corpus (available as LDC corpus LDC93S6B and LDC94S13B). |
| Dataset Splits | Yes | The RNN was trained on both the 14 hour subset train-si84 and the full 81 hour set, with the test-dev93 development set used for validation. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper mentions 'matplotlib python toolkit' and 'Kaldi recipe s5' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The network had five levels of bidirectional LSTM hidden layers, with 500 cells in each layer, giving a total of 26.5M weights. It was trained using stochastic gradient descent with one weight update per utterance, a learning rate of 10 4 and a momentum of 0.9. The DNN was trained with stochastic gradient descent, starting with a learning rate of 0.1, and momentum of 0.9. |