Multichannel End-to-end Speech Recognition
Authors: Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the noisy speech benchmarks (CHi ME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer. |
| Researcher Affiliation | Collaboration | 1Doshisha University, Kyoto, Japan 2Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It mentions: "All the above networks are implemented by using Chainer (Tokui et al., 2015)" which refers to the framework used, not an open-source release of their specific implementation. |
| Open Datasets | Yes | We study the effectiveness of our multichannel end-to-end system compared to a baseline end-to-end system with noisy speech or beamformed inputs. We use the two multichannel speech recognition benchmarks, CHi ME-4 (Vincent et al., 2016) and AMI (Hain et al., 2007). |
| Dataset Splits | Yes | CHi ME-4 ... The training set consists of 3 hours of real speech data uttered by 4 speakers and 15 hours of simulation speech data uttered by 83 speakers. The development set consists of 2.9 hours of real and simulation speech data uttered by 4 speakers, respectively. ... AMI ... The training set consists of about 78 hours of speech data uttered by 135 speakers. The development and evaluation sets consist of about 9 hours of speech data uttered by 18 and 16 speakers, respectively. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments are provided in the paper. |
| Software Dependencies | No | The paper mentions: "All the above networks are implemented by using Chainer (Tokui et al., 2015)." While a software name is given, no version number for Chainer is provided, nor are other specific dependencies with versions. |
| Experiment Setup | Yes | We used 40-dimensional log Mel filterbank coefficients as an input feature vector... We used 4-layer BLSTM with 320 cells in the encoder... 1-layer LSTM with 320 cells in the decoder... we subsampled the hidden states of the first and second layers... After every BLSTM layer, we used a linear projection layer with 320 units... For the attention mechanism, 10 centered convolution filters (DF = 10) of width 100 (Df = 100) were used... attention inner product dimension as 320 (DW = 320), and used the sharpening factor α = 2. To boost the optimization... we adopted a joint CTC-attention multi-task loss function... set the CTC loss weight as 0.1. ... 256 STFT coefficients and the offset were computed from 25ms-width hamming window with 10ms shift (F = 257). Both filter and mask estimation network approaches used similar a 3-layer BLSTM with 320 cells (DZ = 320)... For the reference selection attention mechanism, we used the same attention inner product dimension (DV = 320) and sharpening factor β = 2... We used the Ada Delta algorithm... with gradient clipping... initialized the Ada Delta hyperparameters ρ = 0.95 and ϵ = 1e-8. ... decreased the Ada Delta hyperparameter ϵ by multiplying it by 0.01 at each subsequent epoch. The training procedure was stopped after 15 epochs. |