Attention-via-Attention Neural Machine Translation
Authors: Shenjian Zhao, Zhihua Zhang
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the effectiveness and the efficiency of the proposed attention-via-attention model on the WMT 15 En-Fr and En-Cs translation tasks. We conduct comparison with various strong baselines including RNNsearch (Bahdanau, Cho, and Bengio 2015), GNMT (Wu et al. 2016), bpe2char models (Chung, Cho, and Bengio 2016), char2char models (Lee, Cho, and Hofmann 2016) and hybrid models (Luong and Manning 2016). For fair comparison, two metrics are used: BLEU (Papineni et al. 2002) and chr F3 (Popovic 2015). |
| Researcher Affiliation | Academia | Shenjian Zhao Department of Computer Science and Engineering Shanghai Jiao Tong University sword.york@gmail.com Zhihua Zhang Peking University Beijing Institute of Big Data Research zhzhang@math.pku.edu.cn |
| Pseudocode | No | The paper describes the architecture and mathematical formulations of the proposed model but does not include any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | No | The paper mentions: 'We use the scripts from Moses to compute the BLEU score. For chr F3, we use the implementation from github: https://github.com/rsennrich/subword-nmt.' These links are to third-party tools used for evaluation, not the authors' own source code for their proposed model or methodology. There is no statement indicating the release of their own implementation code. |
| Open Datasets | Yes | We use the parallel corpora from WMT. When comparing with RNNsearch on En-Fr task, we reduce the size of the combined corpus to have 12.1M sentence pairs for fairness. When comparing with GNMT, we use the whole dataset which contains 36M parallel sentences. For En-Cs, we use all parallel corpora available for WMT 15. The URL http://www.statmt.org/wmt15/translation -task.html is also provided. |
| Dataset Splits | Yes | We use newstest2013 as the development set and evaluate the models on newstest2014 and newstest2015 for En-Fr and En-Cs task, respectively. |
| Hardware Specification | Yes | We train each shallow model for approximately 2 weeks on a single Titan X GPU. |
| Software Dependencies | No | The paper mentions using the ADAM optimizer, Moses scripts, and subword-nmt implementation, but it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We use the ADAM optimizer (Kingma and Ba 2015) with minibatch of 100 sentences to train each model. The learning rate is first set to 5e 4 and then halved every epoch. The norm of the gradient is clipped with a threshold of 1. The beam width is set to 12 for all models. |