From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification
Authors: Andre Martins, Ramon Astudillo
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We obtain promising empirical results in multi-label classification problems and in attention-based neural networks for natural language inference. For the latter, we achieve a similar performance as the traditional softmax, but with a selective, more compact, attention focus. We next evaluate empirically the ability of sparsemax for addressing two classes of problems: 1. Label proportion estimation and multi-label classification... 2. Attention-based neural networks... We ran experiments on the task of natural language inference, using the recently released SNLI 1.0 corpus (Bowman et al., 2015)... |
| Researcher Affiliation | Collaboration | Andr e F. T. Martins ANDRE.MARTINS@UNBABEL.COM Ram on F. Astudillo RAMON@UNBABEL.COM Unbabel Lda, Rua Visconde de Santar em, 67-B, 1000-286 Lisboa, Portugal Instituto de Telecomunicac oes (IT), Instituto Superior T ecnico, Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal Instituto de Engenharia de Sistemas e Computadores (INESC-ID), Rua Alves Redol, 9, 1000-029 Lisboa, Portugal |
| Pseudocode | Yes | Algorithm 1 Sparsemax Evaluation Input: z Sort z as z(1) . . . z(K) Find k(z) := max n k [K] | 1 + kz(k) > P Define τ(z) = ( P j k(z) z(j)) 1 k(z) Output: p s.t. pi = [zi τ(z)]+. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the methodology described in the paper is openly available. |
| Open Datasets | Yes | We ran experiments on the task of natural language inference, using the recently released SNLI 1.0 corpus (Bowman et al., 2015)... multi-label classification datasets: the four small-scale datasets used by Koyejo et al. (2015),7 and the much larger Reuters RCV1 v2 dataset of Lewis et al. (2004).8 7Obtained from http://mulan.sourceforge.net/ datasets-mlc.html. 8Obtained from https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/multilabel.html. |
| Dataset Splits | Yes | We used the provided training, development, and test splits. ...tuning the hyperparameters in a heldout validation set (for the Reuters dataset) and with 5-fold cross-validation (for the other four datasets). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running its experiments. It mentions 'GPU-friendly' but no actual hardware used for the empirical evaluation. |
| Software Dependencies | No | The paper mentions software components and algorithms like 'Adam (Kingma & Ba, 2014)', 'Glo Ve vectors (Pennington et al., 2014)', 'gated recurrent units (GRUs, Cho et al. 2014)', 'L-BFGS (Liu & Nocedal, 1989; Nesterov, 1983)' but does not provide specific version numbers for any of these or other software dependencies. |
| Experiment Setup | Yes | We optimized all the systems with Adam (Kingma & Ba, 2014), using the default parameters β1 = 0.9, β2 = 0.999, and ϵ = 10 8, and setting the learning rate to 3 10 4. We tuned a ℓ2-regularization coefficient in {0, 10 4, 3 10 4, 10 3} and, as Rockt aschel et al. (2015), a dropout probability of 0.1 in the inputs and outputs of the network. ...for a maximum of 100 epochs... |