Multichannel End-to-end Speech Recognition

Authors: Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the noisy speech benchmarks (CHi ME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.
Researcher Affiliation Collaboration 1Doshisha University, Kyoto, Japan 2Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. It mentions: "All the above networks are implemented by using Chainer (Tokui et al., 2015)" which refers to the framework used, not an open-source release of their specific implementation.
Open Datasets Yes We study the effectiveness of our multichannel end-to-end system compared to a baseline end-to-end system with noisy speech or beamformed inputs. We use the two multichannel speech recognition benchmarks, CHi ME-4 (Vincent et al., 2016) and AMI (Hain et al., 2007).
Dataset Splits Yes CHi ME-4 ... The training set consists of 3 hours of real speech data uttered by 4 speakers and 15 hours of simulation speech data uttered by 83 speakers. The development set consists of 2.9 hours of real and simulation speech data uttered by 4 speakers, respectively. ... AMI ... The training set consists of about 78 hours of speech data uttered by 135 speakers. The development and evaluation sets consist of about 9 hours of speech data uttered by 18 and 16 speakers, respectively.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments are provided in the paper.
Software Dependencies No The paper mentions: "All the above networks are implemented by using Chainer (Tokui et al., 2015)." While a software name is given, no version number for Chainer is provided, nor are other specific dependencies with versions.
Experiment Setup Yes We used 40-dimensional log Mel filterbank coefficients as an input feature vector... We used 4-layer BLSTM with 320 cells in the encoder... 1-layer LSTM with 320 cells in the decoder... we subsampled the hidden states of the first and second layers... After every BLSTM layer, we used a linear projection layer with 320 units... For the attention mechanism, 10 centered convolution filters (DF = 10) of width 100 (Df = 100) were used... attention inner product dimension as 320 (DW = 320), and used the sharpening factor α = 2. To boost the optimization... we adopted a joint CTC-attention multi-task loss function... set the CTC loss weight as 0.1. ... 256 STFT coefficients and the offset were computed from 25ms-width hamming window with 10ms shift (F = 257). Both filter and mask estimation network approaches used similar a 3-layer BLSTM with 320 cells (DZ = 320)... For the reference selection attention mechanism, we used the same attention inner product dimension (DV = 320) and sharpening factor β = 2... We used the Ada Delta algorithm... with gradient clipping... initialized the Ada Delta hyperparameters ρ = 0.95 and ϵ = 1e-8. ... decreased the Ada Delta hyperparameter ϵ by multiplying it by 0.01 at each subsequent epoch. The training procedure was stopped after 15 epochs.