Online and Linear-Time Attention by Enforcing Monotonic Alignments

Authors: Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, Douglas Eck

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate our proposed approach for learning monotonic alignments, we applied it to a variety of sequence-to-sequence problems: sentence summarization, machine translation, and online speech recognition. In the following subsections, we give an overview of the models used and the results we obtained; for more details about hyperparamers and training specifics please see appendix D.
Researcher Affiliation Industry 1Google Brain, Mountain View, California, USA.
Pseudocode Yes This process is visualized in fig. 2 and is presented more explicitly in algorithm 1 (appendix A). (...) we present our differentiable approach to training the monotonic alignment decoder in algorithm 2 (appendix A).
Open Source Code Yes To facilitate experimentation with our proposed attention mechanism, we have made an example Tensor Flow (Abadi et al., 2016) implementation of our approach available online3 and added a reference implementation to Tensor Flow s tf.contrib.seq2seq module. 3https://github.com/craffel/mad
Open Datasets Yes We tested our approach on two datasets: TIMIT (Garofolo et al., 1993) and the Wall Street Journal corpus (Paul & Baker, 1992). (...) sentence summarization experiment using the Gigaword corpus (...) English to Vietnamese translation using the parallel corpus of TED talks (133K sentence pairs) provided by the IWSLT 2015 Evaluation Campaign (Cettolo et al., 2015).
Dataset Splits Yes We used the standard train/validation/test split and report results on the test set. (...) We used the standard dataset split of si284 for training, dev93 for validation, and eval92 for testing. (...) We use the TED tst2012 (1553 sentences) as a validation set for hyperparameter tuning and TED tst2013 (1268 sentences) as a test set.
Hardware Specification No The paper mentions TensorFlow, which can run on GPUs, but it does not specify any particular hardware components such as specific GPU or CPU models, or memory configurations used for the experiments. It only generally refers to training networks.
Software Dependencies No The paper mentions 'TensorFlow (Abadi et al., 2016)' and the 'Adam optimizer (Kingma & Ba, 2014)', but it does not specify any version numbers for these or any other software components, which is required for reproducibility.
Experiment Setup Yes For more details about hyperparamers and training specifics please see appendix D. (...) All networks were trained using standard cross-entropy loss with teacher forcing against target sequences using the Adam optimizer (Kingma & Ba, 2014). (...) Our encoder RNN consisted of three unidirectional LSTM layers. (...) Our decoder RNN was a single unidirectional LSTM. Our output softmax had 62 dimensions (...). At test time, we utilized a beam search over softmax predictions, with a beam width of 10. (...) We utilized label smoothing during training (Chorowski & Jaitly, 2017), replacing the targets at time yt with a convex weighted combination of the surrounding five labels.