Multilingual Neural Machine Translation with Knowledge Distillation

Authors: Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, Tie-Yan Liu

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on IWSLT, WMT and Ted talk translation datasets demonstrate the effectiveness of our method. Particularly, we show that one model is enough to handle multiple languages (up to 44 languages in our experiment), with comparable or even better accuracy than individual models.
Researcher Affiliation Collaboration 1Microsoft Research Asia {xuta,taoqin,tyliu}@microsoft.com 2Zhejiang University rayeren,zhaozhou@zju.edu.cn 3Key Laboratory of Machine Perception, MOE, School of EECS, Peking University di he@pku.edu.cn
Pseudocode Yes Algorithm 1 Knowledge Distillation for Multilingual NMT
Open Source Code No Our codes are implemented based on fairseq7 and we will release the codes once the paper is published.
Open Datasets Yes We use three datasets in our experiment. IWSLT: We collect 12 languages English translation pairs from IWSLT evaluation campaign2 from year 2014 to 2016. WMT: We collect 6 languages English translation pairs from WMT translation task3. Ted Talk: We use the common corpus of TED talk which contains translations between multiple languages (Ye et al., 2018).
Dataset Splits Yes We use the official validation and test sets for each language pair.
Hardware Specification Yes We train the individual models with 4 NVIDIA Tesla V100 GPU cards and multilingual models with 8 of them.
Software Dependencies No Our codes are implemented based on fairseq and we evaluate the translation quality by tokenized case sensitive BLEU with multi-bleu.pl, but specific version numbers for these software dependencies are not provided.
Experiment Setup Yes For IWSLT and Ted talk tasks, the model hidden size dmodel, feed-forward hidden size dff, number of layer are 256, 1024 and 2, while for WMT task, the three parameters are 512, 2048 and 6 respectively. The mini batch size is set to roughly 8192 tokens. For the individual models, we use 0.2 dropout, while for multilingual models, we use 0.1 dropout according to the validation performance. For knowledge distillation, we set Tcheck = 3000 steps (nearly two training epochs), the accuracy threshold τ = 1 BLEU score, the distillation coefficient λ = 0.5 and the number of teacher s outputs K = 8 according to the validation performance. During inference, we decode with beam search and set beam size to 4 and length penalty α = 1.0 for all the languages.