Mirror-Generative Neural Machine Translation
Authors: Zaixiang Zheng, Hao Zhou, Shujian Huang, Lei Li, Xin-Yu Dai, Jiajun Chen
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that the proposed MGNMT consistently outperforms existing approaches in a variety of language pairs and scenarios, including resource-rich and low-resource situations. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Novel Software Technology, Nanjing University zhengzx@smail.nju.edu.cn,{huangsj,daixinyu,chenjj}@nju.edu.cn 2Byte Dance AI Lab {zhouhao.nlp,lileilab}@bytedance.com |
| Pseudocode | Yes | Algorithm 1 Training MGNMT from Non-Parallel Data; Algorithm 2 MGNMT Decoding with EM Algorithm |
| Open Source Code | No | The paper does not contain any explicit statement or link providing concrete access to the source code for the proposed MGNMT methodology. |
| Open Datasets | Yes | Dataset To evaluate our model in resource-poor scenarios, we conducted experiments on WMT16 English-to/from-Romanian (WMT16 EN RO) translation task... As for resource-rich scenarios, we conducted experiments on WMT14 English-to/from German (WMT14 EN DE), NIST English-to/from-Chinese (NIST EN ZH) translation tasks. For all the languages, we use the non-parallel data from News Crawl, except for NIST EN ZH, where the Chinese monolingual data were extracted from LDC corpus. |
| Dataset Splits | Yes | Dev/Test newstest2013/14 MT06/MT03 newstest2015/16 tst13/14&newstest2014 (Table 1 caption). Also, Table 2 lists our best setting of KL-annealing for each task on the development sets. |
| Hardware Specification | Yes | We trained our models on a single GTX 1080ti GPU. |
| Software Dependencies | No | We implemented our models on the top of Transformer (Vaswani et al., 2017) and RNMT (Bahdanau et al., 2015) and GNMT (Shah & Barber, 2018) as well on Pytorch3. (Footnote 3 mentions PyTorch, but without a version). |
| Experiment Setup | Yes | For all languages pairs, sentence were encoded using byte pair encoding (Sennrich et al., 2016a, BPE) with 32k merge operations... We used the Adam optimizer (Kingma & Ba, 2014) with the same learning rate schedule strategy as Vaswani et al. (2017) with 4k warmup steps. Each mini-batch consists of about 4,096 source and target tokens respectively... For all experiments, word dropout rates were set to a constant of 0.3. Honestly, annealing KL weight is somewhat tricky. Table 2 lists our best setting of KL-annealing for each task on the development sets. |