Neural Machine Translation by Jointly Learning to Align and Translate

Authors: Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio

ICLR 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed approach on the task of English-to-French translation. We use the bilingual, parallel corpora provided by ACL WMT 14. As a comparison, we also report the performance of an RNN Encoder Decoder which was proposed recently by Cho et al. (2014a).
Researcher Affiliation Academia Dzmitry Bahdanau Jacobs University Bremen, Germany Kyung Hyun Cho Yoshua Bengio Universit e de Montr eal
Pseudocode No The paper describes the model architecture and training procedure using mathematical equations and descriptive text, but no explicit 'Pseudocode' or 'Algorithm' block is provided.
Open Source Code Yes Implementations are available at https://github.com/lisa-groundhog/Ground Hog.
Open Datasets Yes We use the bilingual, parallel corpora provided by ACL WMT 14.3 http://www.statmt.org/wmt14/translation-task.html
Dataset Splits Yes We concatenate news-test-2012 and news-test-2013 to make a development (validation) set, and evaluate the models on the test set (news-test-2014) from WMT 14, which consists of 3003 sentences not present in the training data.
Hardware Specification Yes TITAN BLACK, Quadro K-6000 (from Table 2)
Software Dependencies No The paper mentions software like Theano, Adadelta, and Moses for tokenization, but specific version numbers for these dependencies are not provided.
Experiment Setup Yes The encoder and decoder of the RNNencdec have 1000 hidden units each. ... We use a minibatch stochastic gradient descent (SGD) algorithm together with Adadelta (Zeiler, 2012) to train each model. Each SGD update direction is computed using a minibatch of 80 sentences. (Further details in Appendix A.2.3 and B.2 regarding hidden units (1000), embedding dimensionality (620), maxout layer size (500), Adadelta parameters (ϵ = 10^-6 and ρ = 0.95), and gradient normalization threshold (1)).