reproducibilityindex.ai

Robust Front-End for Multi-Channel ASR using Flow-Based Density Estimation

Authors: Hyeongju Kim, Hyeonseung Lee, Woo Hyun Kang, Hyung Yong Kim, Nam Soo Kim

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on the CHi ME-4 dataset show that the proposed method outperforms the conventional techniques where the front-end is trained only with ASR objective. In order to evaluate the proposed method in a noisy speech scenario, we conducted a set of experiments using the CHi ME-4 dataset.
Researcher Affiliation	Academia	Hyeongju Kim , Hyeonseung Lee , Woo Hyun Kang , Hyung Yong Kim and Nam Soo Kim Department of Electrical and Computer Engineering and INMC, Seoul National University, South Korea {hjkim, hslee, whkang, hykim}@hi.snu.ac.kr, nkim@snu.ac.kr
Pseudocode	Yes	Algorithm 1 A joint training step with density estimation
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	In order to evaluate the proposed method in a noisy speech scenario, we conducted a set of experiments using the CHi ME-4 dataset. We also employed Wall Street Journal (WSJ) read speech for single channel clean speech dataset.
Dataset Splits	Yes	The dataset is divided into training, development and evaluation sets. The training set consists of 3 hours of real noisy utterances from 4 speakers and 15 hours of simulated noisy utterances from 83 speakers. The development set consists of 2.9 hours of real and simulated noisy utterances from 4 speakers, respectively. The evaluation set consists of 2.2 hours of utterances for each real and simulated noisy data.
Hardware Specification	No	The paper mentions 'GPU memory' in the context of ASR module size but does not specify any particular GPU models, CPU types, or other hardware used for experiments.
Software Dependencies	No	The paper mentions software like ESPnet but does not provide specific version numbers for any key software components or libraries required for replication.
Experiment Setup	Yes	Neural Beamformer. To compute 200 STFT coefﬁcients (i.e., F=201), the 25ms-width Hanning window with a 10ms shift was used. We used a 3-layer bidirectional LSTM with 300 cells for Bi LSTM in Eq. (3). Also, a linear projection layer with 300 units was inserted after every layer of bidirectional LSTM. For FClayer in Eq. (4), a 1-layer linear transformation was used. ... Mel Flow. We used Mel Flow consisting of 8 afﬁne coupling layers. For each Wave Net2D, the kernel sizes for the ﬁrst and last convolutional layer were set to be 1. The rest of the layers (i.e., middle 4 layers) was composed of 20 channels and kernel with size 3, and used for residual connections, skip connections and gated-tanh unit. ... ASR loss. ... we set λ to 0.5 for the experiments. Total loss. The total loss Ltot is deﬁned as: Ltot = LASR + βLgen, (21) where β is a hyperparameter. We experimented with different values of β. ... For Encoder, a 4-layer 2D convolutional network and a 3-layer bidirectional LSTM with 1024 cells were used. The kernel sizes were set to be (3,3) for all layers in the convolutional network and channels were set to be (1, 64), (64, 64), (64, 128) and (128, 128), respectively.