reproducibilityindex.ai

Multichannel End-to-end Speech Recognition

Authors: Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on the noisy speech benchmarks (CHi ME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.
Researcher Affiliation	Collaboration	1Doshisha University, Kyoto, Japan 2Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It mentions: "All the above networks are implemented by using Chainer (Tokui et al., 2015)" which refers to the framework used, not an open-source release of their specific implementation.
Open Datasets	Yes	We study the effectiveness of our multichannel end-to-end system compared to a baseline end-to-end system with noisy speech or beamformed inputs. We use the two multichannel speech recognition benchmarks, CHi ME-4 (Vincent et al., 2016) and AMI (Hain et al., 2007).
Dataset Splits	Yes	CHi ME-4 ... The training set consists of 3 hours of real speech data uttered by 4 speakers and 15 hours of simulation speech data uttered by 83 speakers. The development set consists of 2.9 hours of real and simulation speech data uttered by 4 speakers, respectively. ... AMI ... The training set consists of about 78 hours of speech data uttered by 135 speakers. The development and evaluation sets consist of about 9 hours of speech data uttered by 18 and 16 speakers, respectively.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments are provided in the paper.
Software Dependencies	No	The paper mentions: "All the above networks are implemented by using Chainer (Tokui et al., 2015)." While a software name is given, no version number for Chainer is provided, nor are other specific dependencies with versions.
Experiment Setup	Yes	We used 40-dimensional log Mel ﬁlterbank coefﬁcients as an input feature vector... We used 4-layer BLSTM with 320 cells in the encoder... 1-layer LSTM with 320 cells in the decoder... we subsampled the hidden states of the ﬁrst and second layers... After every BLSTM layer, we used a linear projection layer with 320 units... For the attention mechanism, 10 centered convolution ﬁlters (DF = 10) of width 100 (Df = 100) were used... attention inner product dimension as 320 (DW = 320), and used the sharpening factor α = 2. To boost the optimization... we adopted a joint CTC-attention multi-task loss function... set the CTC loss weight as 0.1. ... 256 STFT coefﬁcients and the offset were computed from 25ms-width hamming window with 10ms shift (F = 257). Both ﬁlter and mask estimation network approaches used similar a 3-layer BLSTM with 320 cells (DZ = 320)... For the reference selection attention mechanism, we used the same attention inner product dimension (DV = 320) and sharpening factor β = 2... We used the Ada Delta algorithm... with gradient clipping... initialized the Ada Delta hyperparameters ρ = 0.95 and ϵ = 1e-8. ... decreased the Ada Delta hyperparameter ϵ by multiplying it by 0.01 at each subsequent epoch. The training procedure was stopped after 15 epochs.