Sparse and Continuous Attention Mechanisms

Authors: André Martins, António Farinhas, Marcos Treviso, Vlad Niculae, Pedro Aguiar, Mario Figueiredo

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a proof of concept, we apply our models with continuous attention to text classification, machine translation, and visual question answering tasks, with encouraging results ( 4). ... 4 Experiments
Researcher Affiliation Collaboration äInstituto de Telecomunicações, Instituto Superior Técnico, Lisbon, Portugal ÅInstituto de Sistemas e Robótica, Instituto Superior Técnico, Lisbon, Portugal ÆLUMLIS (Lisbon ELLIS Unit), Lisbon, Portugal åInformatics Institute, University of Amsterdam, The Netherlands ãUnbabel, Lisbon, Portugal
Pseudocode Yes Algorithm 1: Continuous softmax attention with S = RD, Ω= Ω1, and Gaussian RBFs.
Open Source Code Yes Software code is available at https://github.com/deep-spin/mcan-vqa-continuous-attention.
Open Datasets Yes We use the IMDB movie review dataset [29], whose inputs are documents (280 words on average) and outputs are sentiment labels (positive/negative). ... We use the De En IWSLT 2017 dataset [30] ... using the VQA-v2 dataset [31]
Dataset Splits No The paper mentions using specific datasets (IMDB, IWSLT 2017, VQA-v2) and refers to 'test-dev' and 'test-standard' splits for VQA-v2. However, it does not explicitly provide the percentages or counts for training, validation, or test splits for any of the datasets, nor does it refer to a specific predefined split with a citation that details these proportions.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies No The paper does not list any specific software dependencies with version numbers, such as PyTorch or TensorFlow versions.
Experiment Setup Yes For the continuous attention models, we normalize the document length L into the unit interval [0, 1], and use f(t)= (t µ)2/2σ2 as the score function, leading to a 1D Gaussian (α = 1) or truncated parabola (α = 2) as the attention density. We compare three attention variants: discrete attention with softmax [14] and sparsemax [6]; continuous attention, where a CNN and max-pooling yield a document representation v from which we compute µ = sigmoid(w 1 v) and σ2 = softplus(w 2 v); ... with 30 Gaussian RBFs and µ linearly spaced in [0, 1] and σ {.03, .1, .3}. ... We use N = 100 142 Gaussian RBFs, with µ linearly spaced in [0, 1]2 and Σ = 0.001 I.