Improving End-to-End Speech Translation by Leveraging Auxiliary Speech and Text Data

Authors: Yuhao Zhang, Chen Xu, Bojie Hu, Chunliang Zhang, Tong Xiao, Jingbo Zhu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments Data We run our experiments on English to German (En-De) and English to French (En-Fr) translation tasks. ... Results Table 2 shows our experimental results. ... Analysis Ablation Study We replace the adapters in the baseline system with our alignment adapter. Table 3 shows that the alignment adapter can achieve better performance.
Researcher Affiliation Collaboration Yuhao Zhang1, Chen Xu1, Bojie Hu3, Chunliang Zhang1,2, Tong Xiao1,2*, Jingbo Zhu 1,2 1 School of Computer Science and Engineering, Northeastern University, Shenyang, China 2 Niu Trans Research, Shenyang, China 3 Tencent Minority-Mandarin Translation, Beijing, China
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not explicitly state that the code for the described methodology is open-source or provide a link to its repository.
Open Datasets Yes For speech data, we use the Librilight (Kahn et al. 2020) which consists of about 60k hours of unlabelled speech. ... We use Libri Speech 960 hours (Panayotov et al. 2015) to train the pre-trained acoustic model on the English ASR task. To adapt the DAE model to the MT task, we use the Opensubtitle En-De and WMT14 En-Fr datasets respectively. ... The Mu ST-C En-De and En-Fr tasks (Di Gangi et al. 2019) ... For the Libri Speech En-Fr task (Kocabiyikoglu, Besacier, and Kraif 2018).
Dataset Splits No The paper mentions using a 'validation set' to stop training ('We stop training until the perplexity converges on the validation set.'), and provides training set sizes for specific tasks, but it does not explicitly state the specific percentages or absolute counts for the training, validation, and test splits.
Hardware Specification No The paper does not provide any specific hardware details such as GPU or CPU models, or cloud computing instance types, used for running the experiments.
Software Dependencies No The paper mentions using 'Fairseq toolkit', 'wav2vec2 model', 'm BART.CC25 model', and 'sentencepiece' but does not provide specific version numbers for these software components.
Experiment Setup Yes For pre-training of SIDAE, we set the coefficient r to 0.3. ... For the alignment adapter, the size of the convolutional layer n is set to 3. For each Conformer layer, there are 1,024 hidden states, 16 attention heads and 4,096 FFN hidden states. We freeze the pre-trained acoustic model in the first 5,000 training steps to warm up the two adapters. The τ and α are set to 0.1 and 0.3. The initial value of β is 1. It then decreases by 0.1 per 5,000 steps until 0. For fine-tuning on the ST task, we use the Adam optimizer with β1 = 0.9 and β2 = 0.98. We use dropout (p = 0.1) and label smoothing (p = 0.1) for robust training. We early stop the training if the last five checkpoints do not improve. For inference, the beam size is set to 4 and the length penalty is set to 1.0.