From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

Authors: Andre Martins, Ramon Astudillo

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We obtain promising empirical results in multi-label classification problems and in attention-based neural networks for natural language inference. For the latter, we achieve a similar performance as the traditional softmax, but with a selective, more compact, attention focus. We next evaluate empirically the ability of sparsemax for addressing two classes of problems: 1. Label proportion estimation and multi-label classification... 2. Attention-based neural networks... We ran experiments on the task of natural language inference, using the recently released SNLI 1.0 corpus (Bowman et al., 2015)...
Researcher Affiliation Collaboration Andr e F. T. Martins ANDRE.MARTINS@UNBABEL.COM Ram on F. Astudillo RAMON@UNBABEL.COM Unbabel Lda, Rua Visconde de Santar em, 67-B, 1000-286 Lisboa, Portugal Instituto de Telecomunicac oes (IT), Instituto Superior T ecnico, Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal Instituto de Engenharia de Sistemas e Computadores (INESC-ID), Rua Alves Redol, 9, 1000-029 Lisboa, Portugal
Pseudocode Yes Algorithm 1 Sparsemax Evaluation Input: z Sort z as z(1) . . . z(K) Find k(z) := max n k [K] | 1 + kz(k) > P Define τ(z) = ( P j k(z) z(j)) 1 k(z) Output: p s.t. pi = [zi τ(z)]+.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the methodology described in the paper is openly available.
Open Datasets Yes We ran experiments on the task of natural language inference, using the recently released SNLI 1.0 corpus (Bowman et al., 2015)... multi-label classification datasets: the four small-scale datasets used by Koyejo et al. (2015),7 and the much larger Reuters RCV1 v2 dataset of Lewis et al. (2004).8 7Obtained from http://mulan.sourceforge.net/ datasets-mlc.html. 8Obtained from https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/multilabel.html.
Dataset Splits Yes We used the provided training, development, and test splits. ...tuning the hyperparameters in a heldout validation set (for the Reuters dataset) and with 5-fold cross-validation (for the other four datasets).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running its experiments. It mentions 'GPU-friendly' but no actual hardware used for the empirical evaluation.
Software Dependencies No The paper mentions software components and algorithms like 'Adam (Kingma & Ba, 2014)', 'Glo Ve vectors (Pennington et al., 2014)', 'gated recurrent units (GRUs, Cho et al. 2014)', 'L-BFGS (Liu & Nocedal, 1989; Nesterov, 1983)' but does not provide specific version numbers for any of these or other software dependencies.
Experiment Setup Yes We optimized all the systems with Adam (Kingma & Ba, 2014), using the default parameters β1 = 0.9, β2 = 0.999, and ϵ = 10 8, and setting the learning rate to 3 10 4. We tuned a ℓ2-regularization coefficient in {0, 10 4, 3 10 4, 10 3} and, as Rockt aschel et al. (2015), a dropout probability of 0.1 in the inputs and outputs of the network. ...for a maximum of 100 epochs...