reproducibilityindex.ai

Towards Making the Most of Context in Neural Machine Translation

Authors: Zaixiang Zheng, Xiang Yue, Shujian Huang, Jiajun Chen, Alexandra Birch

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that our model outperforms Transformer baselines and previous document-level NMT models with substantial margins of up to 2.1 BLEU on state-ofthe-art baselines. We also provide analyses which show the beneﬁt of context far beyond the neighboring two or three sentences, which previous studies have typically incorporated. 5 Experiment We experiment on four widely used document-level parallel datasets in two language pairs for machine translation: TED (ZH-EN/EN-DE).
Researcher Affiliation	Academia	1National Key Laboratory for Novel Software Technology, Nanjing University 2ILCC, School of Informatics, University of Edinburgh
Pseudocode	No	No explicit pseudocode or algorithm blocks were found.
Open Source Code	No	No statement or link for open-source code release was found in the paper.
Open Datasets	Yes	We experiment on four widely used document-level parallel datasets in two language pairs for machine translation: TED (ZH-EN/EN-DE). The Chinese-English and English-German TED datasets are from IWSLT 2015 and 2017 evaluation campaigns respectively. [...] News (EN-DE). We take News Commentary v11 as our training set. The WMT newstest2015 and newstest2016 are used for development and testsets respectively. Europarl (EN-DE). The corpus are extracted from the Europarl v7 according to the method mentioned in Maruf et al. [2019].
Dataset Splits	Yes	We mainly explore and develop our approach on TED ZH-EN, where we take dev2010 as development set and tst2010-2013 as testset. For TED EN-DE, we use tst2016-2017 as our testset and the rest as development set. [...] The WMT newstest2015 and newstest2016 are used for development and testsets respectively.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for experiments were mentioned.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8) were explicitly mentioned.
Experiment Setup	Yes	For models on TED ZH-EN, we used a conﬁguration smaller than transformer base [Vaswani et al., 2017] with model dimension dz = 256, dimension dﬀn = 512 and number of layers N = 4. As for models on the rest datasets, we change the dimensions to 512/2048. We used the Adam optimizer [Kingma and Ba, 2014] and the same learning rate schedule strategy as [Vaswani et al., 2017] with 8,000 warmup steps. The training batch consisted of approximately 2048 source tokens and 2048 target tokens. Label smoothing [Szegedy et al., 2016] of value 0.1 was used for training. For inference, we used beam search with a width of 5 with a length penalty of 0.6.