Multi-Channel Encoder for Neural Machine Translation
Authors: Hao Xiong, Zhongjun He, Xiaoguang Hu, Hua Wu
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical study on Chinese-English translation shows that our model can improve by 6.52 BLEU points upon a strong open source NMT system: DL4MT. On the WMT14 English French task, our single shallow system achieves BLEU=38.8, comparable with the state-of-the-art deep models. |
| Researcher Affiliation | Industry | Hao Xiong, Zhongjun He, Xiaoguang Hu, Hua Wu Baidu Inc. No. 10, Shangdi 10th Street, Beijing, 100085, China {xionghao05, hezhongjun, huxiaoguang, wu hua}@baidu.com |
| Pseudocode | No | The paper describes methods using equations and diagrams (Figure 2) but does not contain a formal pseudocode block or algorithm. |
| Open Source Code | No | On page 3, footnote 1 and 5/6, it lists URLs for DL4MT, T2T, and Conv S2S, which are open-source toolkits used for comparison, not the authors' own implementation code for MCE. There is no explicit statement or link for the MCE implementation. |
| Open Datasets | Yes | We use a subset of the data available for NIST Open MT08 task and WMT 14 parallel corpus as our training data. The detailed data sets are Europarl v7, Common Crawl, UN, News Commentary, Gigaword. |
| Dataset Splits | Yes | For Chinese-English task, We choose NIST 2006 (NIST06) dataset as our development set, and the NIST 2003 (NIST03), 2004 (NIST04) 2005 (NIST05), 2008 (NIST08) and 2012 (NIST12) datasets as our test sets. For English-French task, The news-test-2012 and news-test-2013 are concatenated as our development set, and the news-test-2014 is the test set. |
| Hardware Specification | Yes | As we set the batch size to 128, on Chinese-English task it takes around 1 day to train the basic model on 8 NIVDIA P40 GPUs and on English-French task it takes around 7 days. |
| Software Dependencies | No | For the Chinese-English task, we run widely used open source toolkit DL4MT together with two recently published strong open source toolkits T2T and Conv S2S on the same experimental settings to validate the performance of our models. Beyond that, we also reimplement an attention-based NMT written in tensorflow as our baseline system. |
| Experiment Setup | Yes | we use 512 dimensional word embeddings for both the source and target languages. All hidden layers both in the encoder and the decoder, have 512 memory cells. The output layer size is the same as the hidden size. The dimension of cj is 1024. ... we apply gradient clipping: ... 1.0 in our case. ... we use the Adam optimizer with β1 = 0.9, β2 = 0.98 and ϵ = 10 9. ... we set the batch size to 128... And we use a beam width of 10 in all the experiments. ... we set the dropout rate to 0.5. |