Transductive Ensemble Learning for Neural Machine Translation
Authors: Yiren Wang, Lijun Wu, Yingce Xia, Tao Qin, ChengXiang Zhai, Tie-Yan Liu6291-6298
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on different settings (with/without monolingual data) and different language pairs (English {German, Finnish}). The results show that our approach boosts strong individual models with significant improvement and benefits a lot from more individual models. Specifically, we achieve the state-of-the-art performances on the WMT2016-2018 English German translations. |
| Researcher Affiliation | Collaboration | 1University of Illinois at Urbana-Champaign 2School of Data and Computer Science, Sun Yat-sen University 3Microsoft Research Asia 1{yiren, czhai}@illinois.edu 2wulijun3@mail2.sysu.edu.cn 3{Yingce.Xia, taoqin, tyliu}@microsoft.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions that "The experiments are based on the Py Torch implementation of Transformer" and provides a link to the fairseq GitHub repository (https://github.com/pytorch/fairseq). However, this is a third-party framework used for implementation, not the specific source code for the proposed Transductive Ensemble Learning (TEL) method by the authors. |
| Open Datasets | Yes | The majority of our empirical studies are conducted on the WMT2019 English German (En De) and German English (De En) news translation tasks. We use 5M bitext as our training data3...We also experiment on another two more translation tasks, WMT2019 English Finnish (En Fi) and Finnish English (Fi En) news translations... |
| Dataset Splits | Yes | We use Newstest2015 as the validation set for model selection. |
| Hardware Specification | Yes | The models are trained on 8 M40 GPUs with a batch size of 4096. |
| Software Dependencies | No | The paper states, "The experiments are based on the Py Torch implementation of Transformer." However, it does not specify a version number for PyTorch or any other software libraries used. |
| Experiment Setup | Yes | The dimensions of word embeddings, hidden states and non-linear layer are set as 1024, 1024 and 4096 respectively, and the number of heads for multi-head attention is set as 16. The dropout is 0.3 for both En De and En Fi. All models are optimized with Adam (Kingma and Ba 2015) following the optimizer settings and learning rate schedule in (Vaswani et al. 2017). The models are trained on 8 M40 GPUs with a batch size of 4096. |