Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement

Authors: Yichao Du, Zhirui Zhang, Weizhi Wang, Boxing Chen, Jun Xie, Tong Xu10590-10598

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the Mu ST-C benchmark demonstrate that our proposed approach significantly outperforms state-of-the-art E2E-ST baselines on all 8 language pairs, while achieving better performance in the automatic speech recognition task.
Researcher Affiliation Collaboration 1University of Science and Technology of China, Hefei, China 2Machine Intelligence Technology Lab, Alibaba DAMO Academy 3Rutgers University, New Brunswick, USA
Pseudocode No The paper describes the proposed methods and mathematical formulations, but it does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our code is open-sourced at https://github.com/duyichao/E2E-ST-TDA.
Open Datasets Yes Mu ST-C (Gangi et al. 2019) is a publicly large-scale multilingual speech-to-text translation corpus... We introduce the Libri Speech dataset (Panayotov et al. 2015) as the external ASR data.
Dataset Splits Yes During inference, we average the model parameters on the 10 best checkpoints based on the performance of the Mu ST-C dev set, and adopt beam search strategy with beam size of 5.
Hardware Specification Yes In practice, we train all models with 2 Nvidia Tesla-V100 GPUs and it takes 1-2 days to finish the whole training.
Software Dependencies No The paper states: "All experiments are implemented based on the FAIRSEQ2 (Ott et al. 2019) toolkit." However, it does not provide specific version numbers for FAIRSEQ or any other software dependencies.
Experiment Setup Yes We adopt the transformer-based backbone for all models, consisting of 2 layers of one-dimensional convolutional layers with a down-sampling factor of 4, 12 Transformer encoder layers, and 6 Transformer decoder layers. More specifically, for the small model, we set the size of the self-attention layer, the feed-forward network, and the head to 256, 2048, and 4, respectively; for the medium model, the above parameters are set to 512, 2048, and 8, respectively. All models are initialized using the pre-trained ASR speech encoder to speed up the model convergence. During training, we use the adam optimizer (Kingma and Ba 2015) with a learning rate set to 0.002 to update model parameters with 10K warm-up updates. The label smoothing and dropout ratios are set to 0.1 and 0.3, respectively. The batch size in each GPU is set to 10000, and we accumulate the gradient for every 4 batches. During inference, ... adopt beam search strategy with beam size of 5.