reproducibilityindex.ai

Multilingual Neural Machine Translation with Knowledge Distillation

Authors: Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, Tie-Yan Liu

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on IWSLT, WMT and Ted talk translation datasets demonstrate the effectiveness of our method. Particularly, we show that one model is enough to handle multiple languages (up to 44 languages in our experiment), with comparable or even better accuracy than individual models.
Researcher Affiliation	Collaboration	1Microsoft Research Asia {xuta,taoqin,tyliu}@microsoft.com 2Zhejiang University rayeren,zhaozhou@zju.edu.cn 3Key Laboratory of Machine Perception, MOE, School of EECS, Peking University di he@pku.edu.cn
Pseudocode	Yes	Algorithm 1 Knowledge Distillation for Multilingual NMT
Open Source Code	No	Our codes are implemented based on fairseq7 and we will release the codes once the paper is published.
Open Datasets	Yes	We use three datasets in our experiment. IWSLT: We collect 12 languages English translation pairs from IWSLT evaluation campaign2 from year 2014 to 2016. WMT: We collect 6 languages English translation pairs from WMT translation task3. Ted Talk: We use the common corpus of TED talk which contains translations between multiple languages (Ye et al., 2018).
Dataset Splits	Yes	We use the ofﬁcial validation and test sets for each language pair.
Hardware Specification	Yes	We train the individual models with 4 NVIDIA Tesla V100 GPU cards and multilingual models with 8 of them.
Software Dependencies	No	Our codes are implemented based on fairseq and we evaluate the translation quality by tokenized case sensitive BLEU with multi-bleu.pl, but specific version numbers for these software dependencies are not provided.
Experiment Setup	Yes	For IWSLT and Ted talk tasks, the model hidden size dmodel, feed-forward hidden size dff, number of layer are 256, 1024 and 2, while for WMT task, the three parameters are 512, 2048 and 6 respectively. The mini batch size is set to roughly 8192 tokens. For the individual models, we use 0.2 dropout, while for multilingual models, we use 0.1 dropout according to the validation performance. For knowledge distillation, we set Tcheck = 3000 steps (nearly two training epochs), the accuracy threshold τ = 1 BLEU score, the distillation coefﬁcient λ = 0.5 and the number of teacher s outputs K = 8 according to the validation performance. During inference, we decode with beam search and set beam size to 4 and length penalty α = 1.0 for all the languages.