Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
Authors: Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang, Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Johannes, Bing Jiang, Cai Ju, Billy Jun, Patrick LeGresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li, Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian, Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan, Jun Zhan, Zhenyao Zhu
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This allows us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. |
| Researcher Affiliation | Industry | Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang, Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Vaino Johannes, Bing Jiang, Cai Ju, Billy Jun, Patrick Le Gresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li, Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian, Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan, Jun Zhan, Zhenyao Zhu Baidu Silicon Valley AI Lab1, 1195 Bordeaux Avenue, Sunnyvale CA 94086 USA Baidu Speech Technology Group, No. 10 Xibeiwang East Street, Ke Ji Yuan, Haidian District, Beijing 100193 CHINA |
| Pseudocode | No | The paper describes the model architecture and mathematical equations but does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Details of our CTC implementation will be made available along with open source code. |
| Open Datasets | Yes | We benchmark our system on two test sets from the Wall Street Journal (WSJ) corpus of read news articles and the Libri Speech corpus constructed from audio books (Panayotov et al., 2015). We also tested our system for robustness to common accents using the Vox Forge (http://www.voxforge.org) dataset. Finally, we tested our performance on noisy speech using the test sets from the recently completed third CHi ME challenge (Barker et al., 2015). |
| Dataset Splits | No | The paper mentions using a 'held out development set' for early stopping and tuning parameters, and specific 'Regular Dev' and 'Noisy Dev' sets (2048 utterances each) for certain experiments, but it does not provide overall explicit split percentages or absolute sample counts for the main train/validation/test splits across all datasets. |
| Hardware Specification | Yes | Our software runs on dense compute nodes with 8 NVIDIA Titan X GPUs per node with a theoretical peak throughput of 48 single-precision TFLOP/s. This server uses one NVIDIA Quadro K1200 GPU for RNN evaluation. |
| Software Dependencies | No | The paper mentions custom implementations and optimized kernels from Nervana Systems and NVIDIA but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | The inputs to the network are a sequence of log-spectrograms of power normalized audio clips, calculated on 20ms windows. The network is trained end-to-end using the CTC loss function. Most of our experiments use bidirectional recurrent layers with clipped rectified-linear units (Re LU) σ(x) = min{max{x, 0}, 20} as the activation function. We use stochastic gradient descent with Nesterov momentum along with a minibatch of 512 utterances. If the norm of the gradient exceeds a threshold of 400, it is rescaled to 400. The learning rate is chosen from [1 × 10−4, 6 × 10−4] and annealed by a constant factor of 1.2 after each epoch. We use a momentum of 0.99 for all models. |