Neural Machine Translation by Jointly Learning to Align and Translate
Authors: Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio
ICLR 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed approach on the task of English-to-French translation. We use the bilingual, parallel corpora provided by ACL WMT 14. As a comparison, we also report the performance of an RNN Encoder Decoder which was proposed recently by Cho et al. (2014a). |
| Researcher Affiliation | Academia | Dzmitry Bahdanau Jacobs University Bremen, Germany Kyung Hyun Cho Yoshua Bengio Universit e de Montr eal |
| Pseudocode | No | The paper describes the model architecture and training procedure using mathematical equations and descriptive text, but no explicit 'Pseudocode' or 'Algorithm' block is provided. |
| Open Source Code | Yes | Implementations are available at https://github.com/lisa-groundhog/Ground Hog. |
| Open Datasets | Yes | We use the bilingual, parallel corpora provided by ACL WMT 14.3 http://www.statmt.org/wmt14/translation-task.html |
| Dataset Splits | Yes | We concatenate news-test-2012 and news-test-2013 to make a development (validation) set, and evaluate the models on the test set (news-test-2014) from WMT 14, which consists of 3003 sentences not present in the training data. |
| Hardware Specification | Yes | TITAN BLACK, Quadro K-6000 (from Table 2) |
| Software Dependencies | No | The paper mentions software like Theano, Adadelta, and Moses for tokenization, but specific version numbers for these dependencies are not provided. |
| Experiment Setup | Yes | The encoder and decoder of the RNNencdec have 1000 hidden units each. ... We use a minibatch stochastic gradient descent (SGD) algorithm together with Adadelta (Zeiler, 2012) to train each model. Each SGD update direction is computed using a minibatch of 80 sentences. (Further details in Appendix A.2.3 and B.2 regarding hidden units (1000), embedding dimensionality (620), maxout layer size (500), Adadelta parameters (ϵ = 10^-6 and ρ = 0.95), and gradient normalization threshold (1)). |