Fully Neural Network Based Speech Recognition on Mobile and Embedded Devices
Authors: Jinhwan Park, Yoonho Boo, Iksoo Choi, Sungho Shin, Wonyong Sung
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present real-time speech recognition on smartphones or embedded systems by employing recurrent neural network (RNN) based acoustic models, RNN based language models, and beam-search decoding. The experimental results including the execution time analysis are shown in Section 4. Table 1 shows the CER and WER performance of the RNN models trained with the WSJ SI-284 training set. |
| Researcher Affiliation | Academia | Jinhwan Park Seoul National University bnoo@snu.ac.kr, Yoonho Boo Seoul National University dnsgh@snu.ac.kr, Iksoo Choi Seoul National University akacis@snu.ac.kr, Sungho Shin Seoul National University ssh9919@snu.ac.kr, Wonyong Sung Seoul National University wysung@snu.ac.kr |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It provides equations and describes algorithms in text, for example in Appendix C which is titled "Details of Decoding Algorithm" but does not show pseudocode. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. There is no repository link or explicit code release statement. |
| Open Datasets | Yes | We used Wall Street Journal (WSJ) SI-284 training set (81 hours) for the fast evaluation of AMs. We also trained our system using a larger dataset, Librispeech Corpus [33]. [33] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an ASR corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206 5210. IEEE, 2015. |
| Dataset Splits | Yes | We randomly selected 5% of WSJ LM training text to the valid set, and another 5% to the test set. The remaining 90% of the text is used for training RNN LM. |
| Hardware Specification | Yes | The implementation operates in real-time on the ARM Cortex-A57 based embedded system without GPU support. The ARM CPU has 80 KB L1 data cache and 2,048 KB L2 cache. |
| Software Dependencies | No | Open BLAS library [34] is used for the optimization of computation. For 8-bit implementation, gemmlowp library [35] is employed. The paper mentions the use of Open BLAS and gemmlowp libraries but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | The width of 1-D convolution is set to 15, which seems to be the optimum number at our experiments. We applied batch normalization [28] to the first two convolutional layers and variational dropout [29] to every output of the recurrent layer for regularization. Adam optimizer [30] was applied for training. We used an initial learning rate of 3e-4, and the learning rate was reduced to half if the validation error was not lowered for consecutive 8 epochs. Gradient clipping with a maximum norm of 4.0 was applied. |