Multi-Channel Encoder for Neural Machine Translation

Authors: Hao Xiong, Zhongjun He, Xiaoguang Hu, Hua Wu

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical study on Chinese-English translation shows that our model can improve by 6.52 BLEU points upon a strong open source NMT system: DL4MT. On the WMT14 English French task, our single shallow system achieves BLEU=38.8, comparable with the state-of-the-art deep models.
Researcher Affiliation Industry Hao Xiong, Zhongjun He, Xiaoguang Hu, Hua Wu Baidu Inc. No. 10, Shangdi 10th Street, Beijing, 100085, China {xionghao05, hezhongjun, huxiaoguang, wu hua}@baidu.com
Pseudocode No The paper describes methods using equations and diagrams (Figure 2) but does not contain a formal pseudocode block or algorithm.
Open Source Code No On page 3, footnote 1 and 5/6, it lists URLs for DL4MT, T2T, and Conv S2S, which are open-source toolkits used for comparison, not the authors' own implementation code for MCE. There is no explicit statement or link for the MCE implementation.
Open Datasets Yes We use a subset of the data available for NIST Open MT08 task and WMT 14 parallel corpus as our training data. The detailed data sets are Europarl v7, Common Crawl, UN, News Commentary, Gigaword.
Dataset Splits Yes For Chinese-English task, We choose NIST 2006 (NIST06) dataset as our development set, and the NIST 2003 (NIST03), 2004 (NIST04) 2005 (NIST05), 2008 (NIST08) and 2012 (NIST12) datasets as our test sets. For English-French task, The news-test-2012 and news-test-2013 are concatenated as our development set, and the news-test-2014 is the test set.
Hardware Specification Yes As we set the batch size to 128, on Chinese-English task it takes around 1 day to train the basic model on 8 NIVDIA P40 GPUs and on English-French task it takes around 7 days.
Software Dependencies No For the Chinese-English task, we run widely used open source toolkit DL4MT together with two recently published strong open source toolkits T2T and Conv S2S on the same experimental settings to validate the performance of our models. Beyond that, we also reimplement an attention-based NMT written in tensorflow as our baseline system.
Experiment Setup Yes we use 512 dimensional word embeddings for both the source and target languages. All hidden layers both in the encoder and the decoder, have 512 memory cells. The output layer size is the same as the hidden size. The dimension of cj is 1024. ... we apply gradient clipping: ... 1.0 in our case. ... we use the Adam optimizer with β1 = 0.9, β2 = 0.98 and ϵ = 10 9. ... we set the batch size to 128... And we use a beam width of 10 in all the experiments. ... we set the dropout rate to 0.5.