reproducibilityindex.ai

Sparse and Continuous Attention Mechanisms

Authors: André Martins, António Farinhas, Marcos Treviso, Vlad Niculae, Pedro Aguiar, Mario Figueiredo

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As a proof of concept, we apply our models with continuous attention to text classiﬁcation, machine translation, and visual question answering tasks, with encouraging results ( 4). ... 4 Experiments
Researcher Affiliation	Collaboration	äInstituto de Telecomunicações, Instituto Superior Técnico, Lisbon, Portugal ÅInstituto de Sistemas e Robótica, Instituto Superior Técnico, Lisbon, Portugal ÆLUMLIS (Lisbon ELLIS Unit), Lisbon, Portugal åInformatics Institute, University of Amsterdam, The Netherlands ãUnbabel, Lisbon, Portugal
Pseudocode	Yes	Algorithm 1: Continuous softmax attention with S = RD, Ω= Ω1, and Gaussian RBFs.
Open Source Code	Yes	Software code is available at https://github.com/deep-spin/mcan-vqa-continuous-attention.
Open Datasets	Yes	We use the IMDB movie review dataset [29], whose inputs are documents (280 words on average) and outputs are sentiment labels (positive/negative). ... We use the De En IWSLT 2017 dataset [30] ... using the VQA-v2 dataset [31]
Dataset Splits	No	The paper mentions using specific datasets (IMDB, IWSLT 2017, VQA-v2) and refers to 'test-dev' and 'test-standard' splits for VQA-v2. However, it does not explicitly provide the percentages or counts for training, validation, or test splits for any of the datasets, nor does it refer to a specific predefined split with a citation that details these proportions.
Hardware Specification	No	The paper does not specify any particular hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies	No	The paper does not list any specific software dependencies with version numbers, such as PyTorch or TensorFlow versions.
Experiment Setup	Yes	For the continuous attention models, we normalize the document length L into the unit interval [0, 1], and use f(t)= (t µ)2/2σ2 as the score function, leading to a 1D Gaussian (α = 1) or truncated parabola (α = 2) as the attention density. We compare three attention variants: discrete attention with softmax [14] and sparsemax [6]; continuous attention, where a CNN and max-pooling yield a document representation v from which we compute µ = sigmoid(w 1 v) and σ2 = softplus(w 2 v); ... with 30 Gaussian RBFs and µ linearly spaced in [0, 1] and σ {.03, .1, .3}. ... We use N = 100 142 Gaussian RBFs, with µ linearly spaced in [0, 1]2 and Σ = 0.001 I.