Sparse and Continuous Attention Mechanisms
Authors: André Martins, António Farinhas, Marcos Treviso, Vlad Niculae, Pedro Aguiar, Mario Figueiredo
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As a proof of concept, we apply our models with continuous attention to text classification, machine translation, and visual question answering tasks, with encouraging results ( 4). ... 4 Experiments |
| Researcher Affiliation | Collaboration | äInstituto de Telecomunicações, Instituto Superior Técnico, Lisbon, Portugal ÅInstituto de Sistemas e Robótica, Instituto Superior Técnico, Lisbon, Portugal ÆLUMLIS (Lisbon ELLIS Unit), Lisbon, Portugal åInformatics Institute, University of Amsterdam, The Netherlands ãUnbabel, Lisbon, Portugal |
| Pseudocode | Yes | Algorithm 1: Continuous softmax attention with S = RD, Ω= Ω1, and Gaussian RBFs. |
| Open Source Code | Yes | Software code is available at https://github.com/deep-spin/mcan-vqa-continuous-attention. |
| Open Datasets | Yes | We use the IMDB movie review dataset [29], whose inputs are documents (280 words on average) and outputs are sentiment labels (positive/negative). ... We use the De En IWSLT 2017 dataset [30] ... using the VQA-v2 dataset [31] |
| Dataset Splits | No | The paper mentions using specific datasets (IMDB, IWSLT 2017, VQA-v2) and refers to 'test-dev' and 'test-standard' splits for VQA-v2. However, it does not explicitly provide the percentages or counts for training, validation, or test splits for any of the datasets, nor does it refer to a specific predefined split with a citation that details these proportions. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper does not list any specific software dependencies with version numbers, such as PyTorch or TensorFlow versions. |
| Experiment Setup | Yes | For the continuous attention models, we normalize the document length L into the unit interval [0, 1], and use f(t)= (t µ)2/2σ2 as the score function, leading to a 1D Gaussian (α = 1) or truncated parabola (α = 2) as the attention density. We compare three attention variants: discrete attention with softmax [14] and sparsemax [6]; continuous attention, where a CNN and max-pooling yield a document representation v from which we compute µ = sigmoid(w 1 v) and σ2 = softplus(w 2 v); ... with 30 Gaussian RBFs and µ linearly spaced in [0, 1] and σ {.03, .1, .3}. ... We use N = 100 142 Gaussian RBFs, with µ linearly spaced in [0, 1]2 and Σ = 0.001 I. |