Towards Making the Most of Context in Neural Machine Translation

Authors: Zaixiang Zheng, Xiang Yue, Shujian Huang, Jiajun Chen, Alexandra Birch

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our model outperforms Transformer baselines and previous document-level NMT models with substantial margins of up to 2.1 BLEU on state-ofthe-art baselines. We also provide analyses which show the benefit of context far beyond the neighboring two or three sentences, which previous studies have typically incorporated. 5 Experiment We experiment on four widely used document-level parallel datasets in two language pairs for machine translation: TED (ZH-EN/EN-DE).
Researcher Affiliation Academia 1National Key Laboratory for Novel Software Technology, Nanjing University 2ILCC, School of Informatics, University of Edinburgh
Pseudocode No No explicit pseudocode or algorithm blocks were found.
Open Source Code No No statement or link for open-source code release was found in the paper.
Open Datasets Yes We experiment on four widely used document-level parallel datasets in two language pairs for machine translation: TED (ZH-EN/EN-DE). The Chinese-English and English-German TED datasets are from IWSLT 2015 and 2017 evaluation campaigns respectively. [...] News (EN-DE). We take News Commentary v11 as our training set. The WMT newstest2015 and newstest2016 are used for development and testsets respectively. Europarl (EN-DE). The corpus are extracted from the Europarl v7 according to the method mentioned in Maruf et al. [2019].
Dataset Splits Yes We mainly explore and develop our approach on TED ZH-EN, where we take dev2010 as development set and tst2010-2013 as testset. For TED EN-DE, we use tst2016-2017 as our testset and the rest as development set. [...] The WMT newstest2015 and newstest2016 are used for development and testsets respectively.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for experiments were mentioned.
Software Dependencies No No specific software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8) were explicitly mentioned.
Experiment Setup Yes For models on TED ZH-EN, we used a configuration smaller than transformer base [Vaswani et al., 2017] with model dimension dz = 256, dimension dffn = 512 and number of layers N = 4. As for models on the rest datasets, we change the dimensions to 512/2048. We used the Adam optimizer [Kingma and Ba, 2014] and the same learning rate schedule strategy as [Vaswani et al., 2017] with 8,000 warmup steps. The training batch consisted of approximately 2048 source tokens and 2048 target tokens. Label smoothing [Szegedy et al., 2016] of value 0.1 was used for training. For inference, we used beam search with a width of 5 with a length penalty of 0.6.