Mirror-Generative Neural Machine Translation

Authors: Zaixiang Zheng, Hao Zhou, Shujian Huang, Lei Li, Xin-Yu Dai, Jiajun Chen

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that the proposed MGNMT consistently outperforms existing approaches in a variety of language pairs and scenarios, including resource-rich and low-resource situations.
Researcher Affiliation Collaboration 1National Key Laboratory for Novel Software Technology, Nanjing University zhengzx@smail.nju.edu.cn,{huangsj,daixinyu,chenjj}@nju.edu.cn 2Byte Dance AI Lab {zhouhao.nlp,lileilab}@bytedance.com
Pseudocode Yes Algorithm 1 Training MGNMT from Non-Parallel Data; Algorithm 2 MGNMT Decoding with EM Algorithm
Open Source Code No The paper does not contain any explicit statement or link providing concrete access to the source code for the proposed MGNMT methodology.
Open Datasets Yes Dataset To evaluate our model in resource-poor scenarios, we conducted experiments on WMT16 English-to/from-Romanian (WMT16 EN RO) translation task... As for resource-rich scenarios, we conducted experiments on WMT14 English-to/from German (WMT14 EN DE), NIST English-to/from-Chinese (NIST EN ZH) translation tasks. For all the languages, we use the non-parallel data from News Crawl, except for NIST EN ZH, where the Chinese monolingual data were extracted from LDC corpus.
Dataset Splits Yes Dev/Test newstest2013/14 MT06/MT03 newstest2015/16 tst13/14&newstest2014 (Table 1 caption). Also, Table 2 lists our best setting of KL-annealing for each task on the development sets.
Hardware Specification Yes We trained our models on a single GTX 1080ti GPU.
Software Dependencies No We implemented our models on the top of Transformer (Vaswani et al., 2017) and RNMT (Bahdanau et al., 2015) and GNMT (Shah & Barber, 2018) as well on Pytorch3. (Footnote 3 mentions PyTorch, but without a version).
Experiment Setup Yes For all languages pairs, sentence were encoded using byte pair encoding (Sennrich et al., 2016a, BPE) with 32k merge operations... We used the Adam optimizer (Kingma & Ba, 2014) with the same learning rate schedule strategy as Vaswani et al. (2017) with 4k warmup steps. Each mini-batch consists of about 4,096 source and target tokens respectively... For all experiments, word dropout rates were set to a constant of 0.3. Honestly, annealing KL weight is somewhat tricky. Table 2 lists our best setting of KL-annealing for each task on the development sets.