Neural Machine Translation with Key-Value Memory-Augmented Attention
Authors: Fandong Meng, Zhaopeng Tu, Yong Cheng, Haiyang Wu, Junjie Zhai, Yuekui Yang, Di Wang
IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on Chinese English and WMT17 German English translation tasks demonstrate the superiority of the proposed model. |
| Researcher Affiliation | Industry | Fandong Meng, Zhaopeng Tu, Yong Cheng, Haiyang Wu, Junjie Zhai, Yuekui Yang, Di Wang Tencent AI Lab {fandongmeng,zptu,yongcheng,gavinwu,jasonzhai,yuekuiyang,diwang}@tencent.com |
| Pseudocode | No | The paper describes the model components and operations using text and equations, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | For Zh En, the training data consist of 1.25M sentence pairs extracted from LDC corpora. For De En, we perform our experiments on the corpus provided by WMT17, which contains 5.6M sentence pairs. |
| Dataset Splits | Yes | We choose NIST 2002 (MT02) dataset as our valid set, and NIST 2003-2006 (MT03-06) datasets as our test sets. For De En, we perform our experiments on the corpus provided by WMT17, which contains 5.6M sentence pairs. We use newstest2016 as the development set, and newstest2017 as the testset. |
| Hardware Specification | Yes | When running on a single GPU device Tesla P40, the speed of the RNNSEARCH model is 2773 target words per second, while the speed of the proposed models is 1676 2263 target words per second. |
| Software Dependencies | No | The paper mentions optimizers like SGD and Ada Delta, and uses GRU/RNNs, but does not provide specific version numbers for any software libraries or frameworks used. |
| Experiment Setup | Yes | The parameters are updated by SGD and mini-batch (size 80) with learning rate controlled by Ada Delta [Zeiler, 2012] (ϵ = 1e 6 and ρ = 0.95). ... The dimension of word embedding and hidden layer is 512, and the beam size in testing is 10. We apply dropout on the output layer to avoid over-fitting [Hinton et al., 2012], with dropout rate being 0.5. Hyper parameter λ in Eq. 19 is set to 1.0. |