Modeling Coherence for Discourse Neural Machine Translation

Authors: Hao Xiong, Zhongjun He, Hua Wu, Haifeng Wang7338-7345

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Practical results on multiple discourse test datasets indicate that our model significantly improves the translation quality over the state-of-the-art baseline system by +1.23 BLEU score. Moreover, our model generates more discourse coherent text and obtains +2.2 BLEU improvements when evaluated by discourse metrics.
Researcher Affiliation Industry Hao Xiong, Zhongjun He, Hua Wu, Haifeng Wang Baidu Inc. No. 10, Shangdi 10th Street, Beijing, 100085, China {xionghao05, hezhongjun, wu hua, wanghaifeng} @baidu.com
Pseudocode No The paper describes its model architecture and training procedures but does not include any pseudocode or explicitly labeled algorithm blocks.
Open Source Code No The paper refers to third-party open-source toolkits like "t2t" and "Moses Toolkit" (Footnote 2), but does not state that the authors are releasing their own code for the proposed method.
Open Datasets Yes We evaluate the performance of our model on the IWSLT speech translation task with TED talks (Cettolo, Girardi, and Federico 2012) as training corpus, which includes multiple entire talks.
Dataset Splits Yes Specifically, we take the dev-2010 as our development set, and tst-2013 2015 as our test sets. Statistically, we have 14,258 talks and 231,266 sentences in the training data, 48 talks and 879 sentences in the development set, and 234 talks and 3,874 sentences in the test sets.
Hardware Specification Yes The training speed of two-pass-bleu-rl model is 8 talks per one second running on V100 with 8GPUs, and it needs about 1.5 days to converge.
Software Dependencies Yes t2t: This is the official supplied open source toolkit for running Transformer model. Specifically, we use the v1.6.5 release.
Experiment Setup Yes For all systems, we use the Adam Optimizer (Kingma and Ba 2015) with the identical settings to t2t, to tune the parameters. One thing deserves to be noted is the value of hyperparameter batch size. In general, a large value of batch size achieves better performance when training on large scale corpus (more than millions) (Vaswani et al. 2017). Thus we set the batch size to 320 for t2t system... we set both the embedding and recurrent hidden size to 100, and apply one dropout layer with keeping probability equals to 0.3 between the embedding layer and the bidirectional recurrent layers. As shown in Figure 2, we see that setting the value of λ1 to 0.85 and λ2 to 0.80 produces the best performance for first-pass-rl and two-pass-rl.