Consecutive Decoding for Speech-to-text Translation

Authors: Qianqian Dong, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, Lei Li12738-12748

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method is verified on three mainstream datasets, including Augmented Libri Speech English-French dataset, TED English-German dataset, and TED English-Chinese dataset. Experiments show that our proposed COSTT outperforms the previous state-of the-art methods.
Researcher Affiliation Collaboration Qianqian Dong, 1,2 Mingxuan Wang, 3 Hao Zhou, 3 Shuang Xu, 1 Bo Xu, 1,2 Lei Li 3 1 Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, 100190, China 2 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China 3 Byte Dance AI Lab, China
Pseudocode Yes Algorithm 1 COSTT without pre-training
Open Source Code Yes The code is available at https://github.com/dqqcasia/st.
Open Datasets Yes We conduct experiments on three popular publicly available datasets, including Augmented Libri Speech English French dataset (Kocabiyikoglu, Besacier, and Kraif 2018), IWSLT2018 English-German dataset (Jan et al. 2018) and TED English-Chinese dataset (Liu et al. 2019).
Dataset Splits Yes Augmented Libri Speech: We experiment on the 100 hours clean train set for training, with 2 hours development set and 4 hours test set, corresponding to 47,271, 1071, and 2048 utterances respectively. IWSLT2018 English-German: We use dev2010 as validation set, and tst2013 as test set, corresponding to 653 and 793 utterances respectively. TED English-Chinese: Finally, we get 524 hours train set, 1.5 hours validation set and 2.5 hours test set, corresponding to 308,660, 835 and 1223 utterances respectively.
Hardware Specification Yes We train our models on 1 NVIDIA V100 GPU with a maximum number of 400k training steps.
Software Dependencies No For target French and German text data, we lower case all the texts, tokenize and apply normalize punctuations with the Moses scripts4. For English-French and English German datasets, we apply BPE5 (Sennrich, Haddow, and Birch 2016) to the combination of source and target text to obtain shared subword units. In order to simplify, we use the open-source grapheme to phoneme tool6 to map the transcription to the phoneme sequence.
Experiment Setup Yes The number of transformer blocks is set to 12 and 6 for the acoustic-semantic (AS) phase and the transcription-translation (TT) phase, respectively. ... Spec Augment strategy ... with frequency masking (F = 30, m F = 2) and time masking (T = 40, m T = 2). ... α in Equation 10 is set to 0.5 for all datasets. ... The maximum decoding length is set to 500 for our models with consecutive decoding and 250 for other methods on all datasets.