Area Attention

Authors: Yang Li, Lukasz Kaiser, Samy Bengio, Si Si

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate area attention on two tasks: neural machine translation (both character and token-level) and image captioning, and improve upon strong (state-of-the-art) baselines in all the cases.
Researcher Affiliation Industry 1Google Research, Mountain View, CA, USA. Correspondence to: Yang Li <liyang@google.com>.
Pseudocode Yes We present the Pseudo code for performing Eq. 3, 4 and 5 as well as the shape size of each area in Algorithm 1 and 2.
Open Source Code Yes See Tensor Flow implementation of Area Attention as well as its integration with Transformer and LSTM in https://github.com/tensorflow/tensor2tensor.
Open Datasets Yes We use the same dataset as the one used in (Vaswani et al., 2017) in which the WMT 2014 English-German (EN-DE) dataset contains about 4.5 million English-German sentence pairs, and the English-French (EN-FR) dataset has about 36 million English-French sentence pairs (Wu et al., 2016).
Dataset Splits Yes we trained each model based on the training & development sets provided by the COCO dataset (Lin et al., 2014), which as 82K images for training and 40K for validation.
Hardware Specification Yes trained on one machine with 8 NVIDIA P100 GPUs for a total of 250,000 steps.
Software Dependencies No The paper mentions "Tensor Flow implementation" but does not specify a version number for TensorFlow or any other software dependencies needed to replicate the experiment.
Experiment Setup Yes Tiny (#hidden layers=2, hidden size=128, filter size=512, #attention heads=4), Small (#hidden layers=2, hidden size=256, filter size=1024, #attention heads=4), Base (#hidden layers=6, hidden size=512, filter size=2048, #attention heads=8) and Big (#hidden layers=6, hidden size=1024, filter size=4096 for EN-DE and 8192 for EN-FR, #attention heads=16).