Can Active Memory Replace Attention?
Authors: Łukasz Kaiser, Samy Bengio
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As our main test, we train the models discussed above and a baseline attention model on the WMT 14 English-French translation task. The final BLEU scores and per-word perplexities for these different models are presented in Table 1. In addition to the main large-scale translation task, we tested the Extended Neural GPU on English constituency parsing... |
| Researcher Affiliation | Industry | Łukasz Kaiser Google Brain lukaszkaiser@google.com Samy Bengio Google Brain bengio@google.com |
| Pseudocode | No | The paper describes algorithms using mathematical equations and diagrams, but it does not provide formal pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Our model was implemented using Tensor Flow [26]. Its code is available as open-source at https://github.com/tensorflow/models/tree/master/neural_gpu/. |
| Open Datasets | Yes | As our main test, we train the models discussed above and a baseline attention model on the WMT 14 English-French translation task. This is the same task that was used to introduce attention [5]... We train without any data filtering on the WMT 14 corpus and test on the WMT 14 test set (newstest 14). We only used the standard WSJ dataset for training. |
| Dataset Splits | Yes | We train without any data filtering on the WMT 14 corpus and test on the WMT 14 test set (newstest 14). The parameters for length normalization and coverage penalty are tuned on the development set (newstest 13). |
| Hardware Specification | No | The paper mentions training times but does not specify the type or model of hardware (e.g., GPU, CPU, TPU) used for the experiments. |
| Software Dependencies | No | The paper states 'Our model was implemented using Tensor Flow [26]' but does not provide a specific version number for TensorFlow or any other software dependencies. |
| Experiment Setup | Yes | For the results presented in this paper we used the Adam optimizer [25] with ε = 10 4 and gradients norm clipped to 1. The number of layers was set to l = 2, the width of the state tensors was constant at w = 4, the number of maps was m = 512, and the convolution kernels width and height was always kw = kh = 3. We trained the Extended Neural GPU with the same settings as above, only with m = 256 (instead of m = 512) and dropout of 30% in each step. |