Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation
Authors: Chengyi Wang, Yu Wu, Shujie Liu, Zhenglu Yang, Ming Zhou9161-9168
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our model leads to significant improvements in En-De and En-Fr translation irrespective of the backbones. We conduct comprehensive experiments on the IWSLT18 speech translation benchmark (Jan et al. 2018), demonstrating the effectiveness of each component. Our model can lead to significant improvements for both LSTM and Transformer backbone. |
| Researcher Affiliation | Collaboration | Chengyi Wang,1 Yu Wu,2 Shujie Liu,2 Zhenglu Yang,1 Ming Zhou2 1Nankai University, Tianjin, China 2Microsoft Research Aisa, Beijing, China |
| Pseudocode | No | The paper does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include any statements about open-sourcing its code or provide a link to a code repository for its methodology. |
| Open Datasets | Yes | We conduct experiments on the Speech Translation TED (ST-TED) En-De corpus (Jan et al. 2018) and the augmented Librispeech En-Fr corpus (Kocabiyikoglu, Besacier, and Kraif 2018). ... Aside from ST-TED, we use TED-LIUM2 corpus (Rousseau, Del eglise, and Esteve 2014) with 207h of speech data for ASR pre-training. |
| Dataset Splits | Yes | We split 2k segments from the ST-TED corpus as dev set and tst2010, tst2013, tst2014, tst2015 are used as test sets. The dev set is used as validation set and we report results on the test set. |
| Hardware Specification | Yes | All the models are trained on 4 Tesla P40 GPU for a maximum of 20 epochs. |
| Software Dependencies | No | All our models are implemented based on ESPnet (Watanabe et al. 2018). |
| Experiment Setup | Yes | For LSTM based models, we use a dropout of 0.3 for embedding and encoders. The model is trained using Adadelta with initial learning rate of 1.0. For Transformer based model, we use a dropout rate of 0.1 and a gradient clip of 5.0. Following Dong, Xu, and Xu(2018), we use Adam optimizer according to the learning rate schedule formula: lrate = k d 0.5 model min(n 0.5, n warmup n 1.5) We set k = 10 and warmup n = 25000 in our experiments. ... For training of TCEN, we set αasr = 0.2 and αmt = 0.8 in the pre-training stage... For fine-tune, we use αst = 0.6, αasr = 0.2 and αmt = 0.2. At inference time, we use a beam size of 10 and a length normalization weight of 0.2. |