Cross-Lingual Pre-Training Based Transfer for Zero-Shot Neural Machine Translation

Authors: Baijun Ji, Zhirui Zhang, Xiangyu Duan, Min Zhang, Boxing Chen, Weihua Luo115-122

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on two public datasets show that our approach significantly outperforms strong pivot-based baseline and various multilingual NMT approaches.
Researcher Affiliation Collaboration Baijun Ji, Zhirui Zhang, Xiangyu Duan, Min Zhang, Boxing Chen, Weihua Luo Institute of Artificial Intelligence, Soochow University, Suzhou, China School of Computer Science and Technology, Soochow University, Suzhou, China Alibaba DAMO Academy, Hangzhou, China bjji@stu.suda.edu.cn, {xiangyuduan, minzhang}@suda.edu.cn {zhirui.zzr, boxing.cbx, weihua.luowh}@alibaba-inc.com
Pseudocode No The paper describes algorithms (MLM, TLM, BRLM) in text and uses flowcharts like Figure 2, but there is no explicit pseudocode or algorithm block.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We evaluate our cross-lingual pre-training based transfer approach against several strong baselines on two public datatsets, Europarl (Koehn 2005) and Multi UN (Eisele and Chen 2010), which contain multi-parallel evaluation data to assess the zero-shot performance.
Dataset Splits Yes We use the devtest2006 as the validation set and the test2006 as the test set for Fr Es and De Fr. For distant language pair Ro De, we extract 1,000 overlapping sentences from newstest2016 as the test set and the 2,000 overlapping sentences split from the training set as the validation set since there is no official validation and test sets.
Hardware Specification No The paper discusses model architecture and training parameters, but does not specify any particular hardware components such as GPU or CPU models used for the experiments.
Software Dependencies No The paper mentions several software components such as "Transformer-big model", "Adam optimizer", "multi-bleu.perl script", "tokenizer.perl", "Facebook s cross-lingual pretrained models released by XLM", and "fastalign tool", but does not provide specific version numbers for these, other than an implied '3' for XLM.
Experiment Setup Yes For the fair comparison, the Transformer-big model with 1024 embedding/hidden units, 4096 feed-forward filter size, 6 layers and 8 heads per layer is adopted for all translation models in our experiments. We set the batch size to 2400 per batch and limit sentence length to 100 BPE tokens. We set the attn drop = 0 (a dropout rate on each attention head), which is favorable to the zero-shot translation and has no effect on supervised translation directions (Gu et al. 2019). For the model initialization, we use Facebook s cross-lingual pretrained models released by XLM3 to initialize the encoder part, and the rest parameters are initialized with xavier uniform. We employ the Adam optimizer with lr = 0.0001, twarm up = 4000 and dropout = 0.1. At decoding time, we generate greedily with length penalty α = 1.0. Regarding MLM, TLM and BRLM, as mentioned in the pre-training phase of transfer protocol, we first pre-train MLM on monolingual data of both source and pivot languages, then leverage the parameters of MLM to initialize TLM and the proposed BRLM, which are continued to be optimized with source-pivot bilingual data. In our experiments, we use MLM+TLM, MLM+BRLM to represent this training process. For the masking strategy during training, following Devlin et al. (2018), 15% of BPE tokens are selected to be masked. Among the selected tokens, 80% of them are replaced with [MASK] token, 10% are replaced with a random BPE token, and 10% unchanged.