Modeling Coherence for Discourse Neural Machine Translation
Authors: Hao Xiong, Zhongjun He, Hua Wu, Haifeng Wang7338-7345
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Practical results on multiple discourse test datasets indicate that our model significantly improves the translation quality over the state-of-the-art baseline system by +1.23 BLEU score. Moreover, our model generates more discourse coherent text and obtains +2.2 BLEU improvements when evaluated by discourse metrics. |
| Researcher Affiliation | Industry | Hao Xiong, Zhongjun He, Hua Wu, Haifeng Wang Baidu Inc. No. 10, Shangdi 10th Street, Beijing, 100085, China {xionghao05, hezhongjun, wu hua, wanghaifeng} @baidu.com |
| Pseudocode | No | The paper describes its model architecture and training procedures but does not include any pseudocode or explicitly labeled algorithm blocks. |
| Open Source Code | No | The paper refers to third-party open-source toolkits like "t2t" and "Moses Toolkit" (Footnote 2), but does not state that the authors are releasing their own code for the proposed method. |
| Open Datasets | Yes | We evaluate the performance of our model on the IWSLT speech translation task with TED talks (Cettolo, Girardi, and Federico 2012) as training corpus, which includes multiple entire talks. |
| Dataset Splits | Yes | Specifically, we take the dev-2010 as our development set, and tst-2013 2015 as our test sets. Statistically, we have 14,258 talks and 231,266 sentences in the training data, 48 talks and 879 sentences in the development set, and 234 talks and 3,874 sentences in the test sets. |
| Hardware Specification | Yes | The training speed of two-pass-bleu-rl model is 8 talks per one second running on V100 with 8GPUs, and it needs about 1.5 days to converge. |
| Software Dependencies | Yes | t2t: This is the official supplied open source toolkit for running Transformer model. Specifically, we use the v1.6.5 release. |
| Experiment Setup | Yes | For all systems, we use the Adam Optimizer (Kingma and Ba 2015) with the identical settings to t2t, to tune the parameters. One thing deserves to be noted is the value of hyperparameter batch size. In general, a large value of batch size achieves better performance when training on large scale corpus (more than millions) (Vaswani et al. 2017). Thus we set the batch size to 320 for t2t system... we set both the embedding and recurrent hidden size to 100, and apply one dropout layer with keeping probability equals to 0.3 between the embedding layer and the bidirectional recurrent layers. As shown in Figure 2, we see that setting the value of λ1 to 0.85 and λ2 to 0.80 produces the best performance for first-pass-rl and two-pass-rl. |