reproducibilityindex.ai

Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation

Authors: Chengyi Wang, Yu Wu, Shujie Liu, Zhenglu Yang, Ming Zhou9161-9168

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our model leads to signiﬁcant improvements in En-De and En-Fr translation irrespective of the backbones. We conduct comprehensive experiments on the IWSLT18 speech translation benchmark (Jan et al. 2018), demonstrating the effectiveness of each component. Our model can lead to signiﬁcant improvements for both LSTM and Transformer backbone.
Researcher Affiliation	Collaboration	Chengyi Wang,1 Yu Wu,2 Shujie Liu,2 Zhenglu Yang,1 Ming Zhou2 1Nankai University, Tianjin, China 2Microsoft Research Aisa, Beijing, China
Pseudocode	No	The paper does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not include any statements about open-sourcing its code or provide a link to a code repository for its methodology.
Open Datasets	Yes	We conduct experiments on the Speech Translation TED (ST-TED) En-De corpus (Jan et al. 2018) and the augmented Librispeech En-Fr corpus (Kocabiyikoglu, Besacier, and Kraif 2018). ... Aside from ST-TED, we use TED-LIUM2 corpus (Rousseau, Del eglise, and Esteve 2014) with 207h of speech data for ASR pre-training.
Dataset Splits	Yes	We split 2k segments from the ST-TED corpus as dev set and tst2010, tst2013, tst2014, tst2015 are used as test sets. The dev set is used as validation set and we report results on the test set.
Hardware Specification	Yes	All the models are trained on 4 Tesla P40 GPU for a maximum of 20 epochs.
Software Dependencies	No	All our models are implemented based on ESPnet (Watanabe et al. 2018).
Experiment Setup	Yes	For LSTM based models, we use a dropout of 0.3 for embedding and encoders. The model is trained using Adadelta with initial learning rate of 1.0. For Transformer based model, we use a dropout rate of 0.1 and a gradient clip of 5.0. Following Dong, Xu, and Xu(2018), we use Adam optimizer according to the learning rate schedule formula: lrate = k d 0.5 model min(n 0.5, n warmup n 1.5) We set k = 10 and warmup n = 25000 in our experiments. ... For training of TCEN, we set αasr = 0.2 and αmt = 0.8 in the pre-training stage... For ﬁne-tune, we use αst = 0.6, αasr = 0.2 and αmt = 0.2. At inference time, we use a beam size of 10 and a length normalization weight of 0.2.