Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multichannel End-to-end Speech Recognition
Authors: Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey
ICML 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the noisy speech benchmarks (CHi ME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer. |
| Researcher Affiliation | Collaboration | 1Doshisha University, Kyoto, Japan 2Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It mentions: "All the above networks are implemented by using Chainer (Tokui et al., 2015)" which refers to the framework used, not an open-source release of their specific implementation. |
| Open Datasets | Yes | We study the effectiveness of our multichannel end-to-end system compared to a baseline end-to-end system with noisy speech or beamformed inputs. We use the two multichannel speech recognition benchmarks, CHi ME-4 (Vincent et al., 2016) and AMI (Hain et al., 2007). |
| Dataset Splits | Yes | CHi ME-4 ... The training set consists of 3 hours of real speech data uttered by 4 speakers and 15 hours of simulation speech data uttered by 83 speakers. The development set consists of 2.9 hours of real and simulation speech data uttered by 4 speakers, respectively. ... AMI ... The training set consists of about 78 hours of speech data uttered by 135 speakers. The development and evaluation sets consist of about 9 hours of speech data uttered by 18 and 16 speakers, respectively. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments are provided in the paper. |
| Software Dependencies | No | The paper mentions: "All the above networks are implemented by using Chainer (Tokui et al., 2015)." While a software name is given, no version number for Chainer is provided, nor are other specific dependencies with versions. |
| Experiment Setup | Yes | We used 40-dimensional log Mel filterbank coefficients as an input feature vector... We used 4-layer BLSTM with 320 cells in the encoder... 1-layer LSTM with 320 cells in the decoder... we subsampled the hidden states of the first and second layers... After every BLSTM layer, we used a linear projection layer with 320 units... For the attention mechanism, 10 centered convolution filters (DF = 10) of width 100 (Df = 100) were used... attention inner product dimension as 320 (DW = 320), and used the sharpening factor α = 2. To boost the optimization... we adopted a joint CTC-attention multi-task loss function... set the CTC loss weight as 0.1. ... 256 STFT coefficients and the offset were computed from 25ms-width hamming window with 10ms shift (F = 257). Both filter and mask estimation network approaches used similar a 3-layer BLSTM with 320 cells (DZ = 320)... For the reference selection attention mechanism, we used the same attention inner product dimension (DV = 320) and sharpening factor β = 2... We used the Ada Delta algorithm... with gradient clipping... initialized the Ada Delta hyperparameters ρ = 0.95 and ϵ = 1e-8. ... decreased the Ada Delta hyperparameter ϵ by multiplying it by 0.01 at each subsequent epoch. The training procedure was stopped after 15 epochs. |