reproducibilityindex.ai

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding

Authors: Yuchen Liu, Jiajun Zhang, Hao Xiong, Long Zhou, Zhongjun He, Hua Wu, Haifeng Wang, Chengqing Zong8417-8424

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on TED speech translation corpora have shown that our proposed model can outperform strong baselines on the quality of speech translation and achieve better speech recognition performances as well.
Researcher Affiliation	Collaboration	Yuchen Liu,1,2 Jiajun Zhang,1,2 Hao Xiong,4 Long Zhou,1,2 Zhongjun He,4 Hua Wu,4 Haifeng Wang,4 Chengqing Zong1,2,3 1National Laboratory of Pattern Recognition, Institute of Automation, CAS 2University of Chinese Academy of Sciences 3CAS Center for Excellence in Brain Science and Intelligence Technology 4Baidu Inc., No. 10, Shangdi 10th Street, Beijing, China
Pseudocode	No	The paper describes the model architecture and approach in detail but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement or link for the open-source code of the methodology described.
Open Datasets	Yes	Finally, we obtain 235K/299K/299K/273K triplet data for En-De/Fr/Zh/Ja language pairs respectively, which contain speech utterances, manual transcriptions and translations. Development and test sets are split according to the partition in IWSLT. We use tst2014 as development (Dev) set and tst2015 as test set. The remaining data are used as training set. This dataset is available on http://www.nlpr.ia.ac.cn/cip/dataset.htm.
Dataset Splits	Yes	Development and test sets are split according to the partition in IWSLT. We use tst2014 as development (Dev) set and tst2015 as test set. The remaining data are used as training set.
Hardware Specification	Yes	We train our models with Adam optimizer (Kingma and Ba 2015) on 2 NVIDIA V100 GPUs.
Software Dependencies	No	The paper mentions software like "Moses", "Jieba", "Mecab", "BPE method", and "Adam optimizer", but it does not specify any version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup	Yes	The speech features have 80-dimension log-Mel ﬁlterbanks extracted with a step size of 10ms and window size of 25ms, which are extended with mean subtraction and variance normalization. The features are stacked with 3 frames to the left and downsampled to a 30ms frame rate. ... We use the conﬁguration transformer base used by Vaswani et al. (2017) which contains 6-layer encoders and 6-layer decoders with 512-dimensional hidden sizes. We train our models with Adam optimizer (Kingma and Ba 2015) on 2 NVIDIA V100 GPUs. For inference, we perform beam search with a beam size of 4. We set λ = 0.3 and k = 3 in the interactive learning model.