reproducibilityindex.ai

Can Active Memory Replace Attention?

Authors: Łukasz Kaiser, Samy Bengio

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As our main test, we train the models discussed above and a baseline attention model on the WMT 14 English-French translation task. The ﬁnal BLEU scores and per-word perplexities for these different models are presented in Table 1. In addition to the main large-scale translation task, we tested the Extended Neural GPU on English constituency parsing...
Researcher Affiliation	Industry	Łukasz Kaiser Google Brain lukaszkaiser@google.com Samy Bengio Google Brain bengio@google.com
Pseudocode	No	The paper describes algorithms using mathematical equations and diagrams, but it does not provide formal pseudocode blocks or algorithm listings.
Open Source Code	Yes	Our model was implemented using Tensor Flow [26]. Its code is available as open-source at https://github.com/tensorflow/models/tree/master/neural_gpu/.
Open Datasets	Yes	As our main test, we train the models discussed above and a baseline attention model on the WMT 14 English-French translation task. This is the same task that was used to introduce attention [5]... We train without any data ﬁltering on the WMT 14 corpus and test on the WMT 14 test set (newstest 14). We only used the standard WSJ dataset for training.
Dataset Splits	Yes	We train without any data ﬁltering on the WMT 14 corpus and test on the WMT 14 test set (newstest 14). The parameters for length normalization and coverage penalty are tuned on the development set (newstest 13).
Hardware Specification	No	The paper mentions training times but does not specify the type or model of hardware (e.g., GPU, CPU, TPU) used for the experiments.
Software Dependencies	No	The paper states 'Our model was implemented using Tensor Flow [26]' but does not provide a specific version number for TensorFlow or any other software dependencies.
Experiment Setup	Yes	For the results presented in this paper we used the Adam optimizer [25] with ε = 10 4 and gradients norm clipped to 1. The number of layers was set to l = 2, the width of the state tensors was constant at w = 4, the number of maps was m = 512, and the convolution kernels width and height was always kw = kh = 3. We trained the Extended Neural GPU with the same settings as above, only with m = 256 (instead of m = 512) and dropout of 30% in each step.