Robust Front-End for Multi-Channel ASR using Flow-Based Density Estimation
Authors: Hyeongju Kim, Hyeonseung Lee, Woo Hyun Kang, Hyung Yong Kim, Nam Soo Kim
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on the CHi ME-4 dataset show that the proposed method outperforms the conventional techniques where the front-end is trained only with ASR objective. In order to evaluate the proposed method in a noisy speech scenario, we conducted a set of experiments using the CHi ME-4 dataset. |
| Researcher Affiliation | Academia | Hyeongju Kim , Hyeonseung Lee , Woo Hyun Kang , Hyung Yong Kim and Nam Soo Kim Department of Electrical and Computer Engineering and INMC, Seoul National University, South Korea {hjkim, hslee, whkang, hykim}@hi.snu.ac.kr, nkim@snu.ac.kr |
| Pseudocode | Yes | Algorithm 1 A joint training step with density estimation |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | In order to evaluate the proposed method in a noisy speech scenario, we conducted a set of experiments using the CHi ME-4 dataset. We also employed Wall Street Journal (WSJ) read speech for single channel clean speech dataset. |
| Dataset Splits | Yes | The dataset is divided into training, development and evaluation sets. The training set consists of 3 hours of real noisy utterances from 4 speakers and 15 hours of simulated noisy utterances from 83 speakers. The development set consists of 2.9 hours of real and simulated noisy utterances from 4 speakers, respectively. The evaluation set consists of 2.2 hours of utterances for each real and simulated noisy data. |
| Hardware Specification | No | The paper mentions 'GPU memory' in the context of ASR module size but does not specify any particular GPU models, CPU types, or other hardware used for experiments. |
| Software Dependencies | No | The paper mentions software like ESPnet but does not provide specific version numbers for any key software components or libraries required for replication. |
| Experiment Setup | Yes | Neural Beamformer. To compute 200 STFT coefficients (i.e., F=201), the 25ms-width Hanning window with a 10ms shift was used. We used a 3-layer bidirectional LSTM with 300 cells for Bi LSTM in Eq. (3). Also, a linear projection layer with 300 units was inserted after every layer of bidirectional LSTM. For FClayer in Eq. (4), a 1-layer linear transformation was used. ... Mel Flow. We used Mel Flow consisting of 8 affine coupling layers. For each Wave Net2D, the kernel sizes for the first and last convolutional layer were set to be 1. The rest of the layers (i.e., middle 4 layers) was composed of 20 channels and kernel with size 3, and used for residual connections, skip connections and gated-tanh unit. ... ASR loss. ... we set λ to 0.5 for the experiments. Total loss. The total loss Ltot is defined as: Ltot = LASR + βLgen, (21) where β is a hyperparameter. We experimented with different values of β. ... For Encoder, a 4-layer 2D convolutional network and a 3-layer bidirectional LSTM with 1024 cells were used. The kernel sizes were set to be (3,3) for all layers in the convolutional network and channels were set to be (1, 64), (64, 64), (64, 128) and (128, 128), respectively. |