Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding
Authors: Yuchen Liu, Jiajun Zhang, Hao Xiong, Long Zhou, Zhongjun He, Hua Wu, Haifeng Wang, Chengqing Zong8417-8424
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on TED speech translation corpora have shown that our proposed model can outperform strong baselines on the quality of speech translation and achieve better speech recognition performances as well. |
| Researcher Affiliation | Collaboration | Yuchen Liu,1,2 Jiajun Zhang,1,2 Hao Xiong,4 Long Zhou,1,2 Zhongjun He,4 Hua Wu,4 Haifeng Wang,4 Chengqing Zong1,2,3 1National Laboratory of Pattern Recognition, Institute of Automation, CAS 2University of Chinese Academy of Sciences 3CAS Center for Excellence in Brain Science and Intelligence Technology 4Baidu Inc., No. 10, Shangdi 10th Street, Beijing, China |
| Pseudocode | No | The paper describes the model architecture and approach in detail but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of the methodology described. |
| Open Datasets | Yes | Finally, we obtain 235K/299K/299K/273K triplet data for En-De/Fr/Zh/Ja language pairs respectively, which contain speech utterances, manual transcriptions and translations. Development and test sets are split according to the partition in IWSLT. We use tst2014 as development (Dev) set and tst2015 as test set. The remaining data are used as training set. This dataset is available on http://www.nlpr.ia.ac.cn/cip/dataset.htm. |
| Dataset Splits | Yes | Development and test sets are split according to the partition in IWSLT. We use tst2014 as development (Dev) set and tst2015 as test set. The remaining data are used as training set. |
| Hardware Specification | Yes | We train our models with Adam optimizer (Kingma and Ba 2015) on 2 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions software like "Moses", "Jieba", "Mecab", "BPE method", and "Adam optimizer", but it does not specify any version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | The speech features have 80-dimension log-Mel filterbanks extracted with a step size of 10ms and window size of 25ms, which are extended with mean subtraction and variance normalization. The features are stacked with 3 frames to the left and downsampled to a 30ms frame rate. ... We use the configuration transformer base used by Vaswani et al. (2017) which contains 6-layer encoders and 6-layer decoders with 512-dimensional hidden sizes. We train our models with Adam optimizer (Kingma and Ba 2015) on 2 NVIDIA V100 GPUs. For inference, we perform beam search with a beam size of 4. We set λ = 0.3 and k = 3 in the interactive learning model. |