Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement
Authors: Yichao Du, Zhirui Zhang, Weizhi Wang, Boxing Chen, Jun Xie, Tong Xu10590-10598
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the Mu ST-C benchmark demonstrate that our proposed approach significantly outperforms state-of-the-art E2E-ST baselines on all 8 language pairs, while achieving better performance in the automatic speech recognition task. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China, Hefei, China 2Machine Intelligence Technology Lab, Alibaba DAMO Academy 3Rutgers University, New Brunswick, USA |
| Pseudocode | No | The paper describes the proposed methods and mathematical formulations, but it does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Our code is open-sourced at https://github.com/duyichao/E2E-ST-TDA. |
| Open Datasets | Yes | Mu ST-C (Gangi et al. 2019) is a publicly large-scale multilingual speech-to-text translation corpus... We introduce the Libri Speech dataset (Panayotov et al. 2015) as the external ASR data. |
| Dataset Splits | Yes | During inference, we average the model parameters on the 10 best checkpoints based on the performance of the Mu ST-C dev set, and adopt beam search strategy with beam size of 5. |
| Hardware Specification | Yes | In practice, we train all models with 2 Nvidia Tesla-V100 GPUs and it takes 1-2 days to finish the whole training. |
| Software Dependencies | No | The paper states: "All experiments are implemented based on the FAIRSEQ2 (Ott et al. 2019) toolkit." However, it does not provide specific version numbers for FAIRSEQ or any other software dependencies. |
| Experiment Setup | Yes | We adopt the transformer-based backbone for all models, consisting of 2 layers of one-dimensional convolutional layers with a down-sampling factor of 4, 12 Transformer encoder layers, and 6 Transformer decoder layers. More specifically, for the small model, we set the size of the self-attention layer, the feed-forward network, and the head to 256, 2048, and 4, respectively; for the medium model, the above parameters are set to 512, 2048, and 8, respectively. All models are initialized using the pre-trained ASR speech encoder to speed up the model convergence. During training, we use the adam optimizer (Kingma and Ba 2015) with a learning rate set to 0.002 to update model parameters with 10K warm-up updates. The label smoothing and dropout ratios are set to 0.1 and 0.3, respectively. The batch size in each GPU is set to 10000, and we accumulate the gradient for every 4 batches. During inference, ... adopt beam search strategy with beam size of 5. |