Fully Neural Network Based Speech Recognition on Mobile and Embedded Devices

Authors: Jinhwan Park, Yoonho Boo, Iksoo Choi, Sungho Shin, Wonyong Sung

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present real-time speech recognition on smartphones or embedded systems by employing recurrent neural network (RNN) based acoustic models, RNN based language models, and beam-search decoding. The experimental results including the execution time analysis are shown in Section 4. Table 1 shows the CER and WER performance of the RNN models trained with the WSJ SI-284 training set.
Researcher Affiliation Academia Jinhwan Park Seoul National University bnoo@snu.ac.kr, Yoonho Boo Seoul National University dnsgh@snu.ac.kr, Iksoo Choi Seoul National University akacis@snu.ac.kr, Sungho Shin Seoul National University ssh9919@snu.ac.kr, Wonyong Sung Seoul National University wysung@snu.ac.kr
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It provides equations and describes algorithms in text, for example in Appendix C which is titled "Details of Decoding Algorithm" but does not show pseudocode.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. There is no repository link or explicit code release statement.
Open Datasets Yes We used Wall Street Journal (WSJ) SI-284 training set (81 hours) for the fast evaluation of AMs. We also trained our system using a larger dataset, Librispeech Corpus [33]. [33] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an ASR corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206 5210. IEEE, 2015.
Dataset Splits Yes We randomly selected 5% of WSJ LM training text to the valid set, and another 5% to the test set. The remaining 90% of the text is used for training RNN LM.
Hardware Specification Yes The implementation operates in real-time on the ARM Cortex-A57 based embedded system without GPU support. The ARM CPU has 80 KB L1 data cache and 2,048 KB L2 cache.
Software Dependencies No Open BLAS library [34] is used for the optimization of computation. For 8-bit implementation, gemmlowp library [35] is employed. The paper mentions the use of Open BLAS and gemmlowp libraries but does not provide specific version numbers for these software components.
Experiment Setup Yes The width of 1-D convolution is set to 15, which seems to be the optimum number at our experiments. We applied batch normalization [28] to the first two convolutional layers and variational dropout [29] to every output of the recurrent layer for regularization. Adam optimizer [30] was applied for training. We used an initial learning rate of 3e-4, and the learning rate was reduced to half if the validation error was not lowered for consecutive 8 epochs. Gradient clipping with a maximum norm of 4.0 was applied.