Towards Making the Most of Context in Neural Machine Translation
Authors: Zaixiang Zheng, Xiang Yue, Shujian Huang, Jiajun Chen, Alexandra Birch
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our model outperforms Transformer baselines and previous document-level NMT models with substantial margins of up to 2.1 BLEU on state-ofthe-art baselines. We also provide analyses which show the benefit of context far beyond the neighboring two or three sentences, which previous studies have typically incorporated. 5 Experiment We experiment on four widely used document-level parallel datasets in two language pairs for machine translation: TED (ZH-EN/EN-DE). |
| Researcher Affiliation | Academia | 1National Key Laboratory for Novel Software Technology, Nanjing University 2ILCC, School of Informatics, University of Edinburgh |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found. |
| Open Source Code | No | No statement or link for open-source code release was found in the paper. |
| Open Datasets | Yes | We experiment on four widely used document-level parallel datasets in two language pairs for machine translation: TED (ZH-EN/EN-DE). The Chinese-English and English-German TED datasets are from IWSLT 2015 and 2017 evaluation campaigns respectively. [...] News (EN-DE). We take News Commentary v11 as our training set. The WMT newstest2015 and newstest2016 are used for development and testsets respectively. Europarl (EN-DE). The corpus are extracted from the Europarl v7 according to the method mentioned in Maruf et al. [2019]. |
| Dataset Splits | Yes | We mainly explore and develop our approach on TED ZH-EN, where we take dev2010 as development set and tst2010-2013 as testset. For TED EN-DE, we use tst2016-2017 as our testset and the rest as development set. [...] The WMT newstest2015 and newstest2016 are used for development and testsets respectively. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for experiments were mentioned. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8) were explicitly mentioned. |
| Experiment Setup | Yes | For models on TED ZH-EN, we used a configuration smaller than transformer base [Vaswani et al., 2017] with model dimension dz = 256, dimension dffn = 512 and number of layers N = 4. As for models on the rest datasets, we change the dimensions to 512/2048. We used the Adam optimizer [Kingma and Ba, 2014] and the same learning rate schedule strategy as [Vaswani et al., 2017] with 8,000 warmup steps. The training batch consisted of approximately 2048 source tokens and 2048 target tokens. Label smoothing [Szegedy et al., 2016] of value 0.1 was used for training. For inference, we used beam search with a width of 5 with a length penalty of 0.6. |