Neural Machine Translation with Soft Prototype
Authors: Yiren Wang, Yingce Xia, Fei Tian, Fei Gao, Tao Qin, Cheng Xiang Zhai, Tie-Yan Liu
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies on various neural machine translation tasks show that our approach brings substantial improvement in generation quality over the baseline model, with little extra cost in storage and inference time, demonstrating the effectiveness of our proposed framework. Specially, we achieve state-of-the-art results on WMT2014, 2015 and 2017 English German translation. |
| Researcher Affiliation | Collaboration | Yiren Wang1, , Yingce Xia2, , Fei Tian3, Fei Gao4, Tao Qin2, Cheng Xiang Zhai1, Tie-Yan Liu2 1University of Illinois at Urbana-Champaign, 2Microsoft Research, 3Facebook, 4Institute of Computing Technology, Chinese Academy of Sciences 1{yiren, czhai}@illinois.edu 2{yingce.xia, taoqin, tie-yan.liu}@microsoft.com 3feitia@fb.com 4gaofei17n@ict.ac.cn |
| Pseudocode | No | The source codes are included in the supplementary documents and details can be found at transformer_softproto.py. The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source codes are included in the supplementary documents and details can be found at transformer_softproto.py. |
| Open Datasets | Yes | Datasets We experiment with two large scale and widely adopted benchmark datasets, the WMT2014 English to German news translation (En De) and WMT2014 English to French news translation (En Fr). We use 4.5M bilingual sentence pairs as the training data for En De and 36M pairs for En Fr5.We use 50M monolingual sentences for each language from Newscrawl 2007-2017 as training data following [10], |
| Dataset Splits | Yes | We use 4.5M bilingual sentence pairs as the training data for En De and 36M pairs for En Fr5. We use the concatenation of Newstest2012 and Newstest2013 as the validation set (6003 sentences) and Newstest2014 as the test set (3003 sentences). |
| Hardware Specification | Yes | The models are trained on 8 M40 GPUs for 10 days for En De and 21 days for En Fr.We train our models on 8 M40 GPUs with 4.5M bitext from WMT2014 En De for another 1.5 days. |
| Software Dependencies | No | We use Adam [7] with same learning rate scheduler used in [16] for optimization.and use multi-bleu6 to evaluate the quality of translation.and all models are evaluated on various test sets (Newstest2014-2018) with sacre BLEU9. No specific version numbers are provided for software. |
| Experiment Setup | Yes | We use the transformer_big setting following [16], with a 6-layer encoder and 6-layer decoder. The dimensions of word embeddings, hidden states and the filter sizes are 1024, 1024 and 4096 respectively. The dropout is 0.3 for En De and 0.1 for En Fr. The models are trained on 8 M40 GPUs for 10 days for En De and 21 days for En Fr. κ is fixed as 3 across all tasks. We use beam size 4 and length penalty 0.6 for inference, and use multi-bleu6 to evaluate the quality of translation.We use Adam [7] with same learning rate scheduler used in [16] for optimization.The translations are generated with beam size of 5 and length penalty 1.0 |