Multilingual Neural Machine Translation with Knowledge Distillation
Authors: Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, Tie-Yan Liu
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on IWSLT, WMT and Ted talk translation datasets demonstrate the effectiveness of our method. Particularly, we show that one model is enough to handle multiple languages (up to 44 languages in our experiment), with comparable or even better accuracy than individual models. |
| Researcher Affiliation | Collaboration | 1Microsoft Research Asia {xuta,taoqin,tyliu}@microsoft.com 2Zhejiang University rayeren,zhaozhou@zju.edu.cn 3Key Laboratory of Machine Perception, MOE, School of EECS, Peking University di he@pku.edu.cn |
| Pseudocode | Yes | Algorithm 1 Knowledge Distillation for Multilingual NMT |
| Open Source Code | No | Our codes are implemented based on fairseq7 and we will release the codes once the paper is published. |
| Open Datasets | Yes | We use three datasets in our experiment. IWSLT: We collect 12 languages English translation pairs from IWSLT evaluation campaign2 from year 2014 to 2016. WMT: We collect 6 languages English translation pairs from WMT translation task3. Ted Talk: We use the common corpus of TED talk which contains translations between multiple languages (Ye et al., 2018). |
| Dataset Splits | Yes | We use the official validation and test sets for each language pair. |
| Hardware Specification | Yes | We train the individual models with 4 NVIDIA Tesla V100 GPU cards and multilingual models with 8 of them. |
| Software Dependencies | No | Our codes are implemented based on fairseq and we evaluate the translation quality by tokenized case sensitive BLEU with multi-bleu.pl, but specific version numbers for these software dependencies are not provided. |
| Experiment Setup | Yes | For IWSLT and Ted talk tasks, the model hidden size dmodel, feed-forward hidden size dff, number of layer are 256, 1024 and 2, while for WMT task, the three parameters are 512, 2048 and 6 respectively. The mini batch size is set to roughly 8192 tokens. For the individual models, we use 0.2 dropout, while for multilingual models, we use 0.1 dropout according to the validation performance. For knowledge distillation, we set Tcheck = 3000 steps (nearly two training epochs), the accuracy threshold τ = 1 BLEU score, the distillation coefficient λ = 0.5 and the number of teacher s outputs K = 8 according to the validation performance. During inference, we decode with beam search and set beam size to 4 and length penalty α = 1.0 for all the languages. |