Neural Machine Translation with Key-Value Memory-Augmented Attention

Authors: Fandong Meng, Zhaopeng Tu, Yong Cheng, Haiyang Wu, Junjie Zhai, Yuekui Yang, Di Wang

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on Chinese English and WMT17 German English translation tasks demonstrate the superiority of the proposed model.
Researcher Affiliation Industry Fandong Meng, Zhaopeng Tu, Yong Cheng, Haiyang Wu, Junjie Zhai, Yuekui Yang, Di Wang Tencent AI Lab {fandongmeng,zptu,yongcheng,gavinwu,jasonzhai,yuekuiyang,diwang}@tencent.com
Pseudocode No The paper describes the model components and operations using text and equations, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes For Zh En, the training data consist of 1.25M sentence pairs extracted from LDC corpora. For De En, we perform our experiments on the corpus provided by WMT17, which contains 5.6M sentence pairs.
Dataset Splits Yes We choose NIST 2002 (MT02) dataset as our valid set, and NIST 2003-2006 (MT03-06) datasets as our test sets. For De En, we perform our experiments on the corpus provided by WMT17, which contains 5.6M sentence pairs. We use newstest2016 as the development set, and newstest2017 as the testset.
Hardware Specification Yes When running on a single GPU device Tesla P40, the speed of the RNNSEARCH model is 2773 target words per second, while the speed of the proposed models is 1676 2263 target words per second.
Software Dependencies No The paper mentions optimizers like SGD and Ada Delta, and uses GRU/RNNs, but does not provide specific version numbers for any software libraries or frameworks used.
Experiment Setup Yes The parameters are updated by SGD and mini-batch (size 80) with learning rate controlled by Ada Delta [Zeiler, 2012] (ϵ = 1e 6 and ρ = 0.95). ... The dimension of word embedding and hidden layer is 512, and the beam size in testing is 10. We apply dropout on the output layer to avoid over-fitting [Hinton et al., 2012], with dropout rate being 0.5. Hyper parameter λ in Eq. 19 is set to 1.0.