Sparse Sinkhorn Attention

Authors: Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, Da-Cheng Juan

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Via extensive experiments on algorithmic seq2seq sorting, language modeling, pixel-wise image generation, document classification and natural language inference, we demonstrate that our memory efficient Sinkhorn Attention method is competitive with vanilla attention and consistently outperforms recently proposed efficient Transformer models such as Sparse Transformers.
Researcher Affiliation Industry Yi Tay 1 Dara Bahri 1 Liu Yang 1 Donald Metzler 1 Da-Cheng Juan 1 1Google Research. Correspondence to: Yi Tay <yitay@ google.com>.
Pseudocode No The paper describes the methods using mathematical formulations and descriptive text, but does not include a dedicated pseudocode block or algorithm figure.
Open Source Code No The paper mentions using the 'open source Tensor2Tensor framework (Vaswani et al., 2018)' but does not provide a link or explicit statement about the release of their own source code for the proposed Sparse Sinkhorn Attention method.
Open Datasets Yes We evaluate on the LM1B (Language Modeling One Billion) dataset (Chelba et al., 2013), ... We use the CIFAR-10 dataset. ... IMDb sentiment (Maas et al., 2011) and Sentiment Treebank (SST) dataset (Socher et al., 2013). ... Stanford NLI (Bowman et al., 2015) and Multi NLI (Williams et al., 2017).
Dataset Splits No For the algorithmic sort problem, the paper states: 'The dataset consists of 100K train examples and 1000 test examples,' but no explicit validation split is mentioned. For other tasks, the paper refers to using the 'default Tensor2Tensor hyperparameters' or framework without explicitly detailing the validation split percentages or sample counts.
Hardware Specification Yes All models are trained for 300K steps on 16 TPU V2 Chips.
Software Dependencies No The paper states: 'All our experiments are run on the open source Tensor2Tensor framework (Vaswani et al., 2018),' but no specific version numbers for Tensor2Tensor or other software dependencies are provided.
Experiment Setup Yes Our Sinkhorn Transformers adopt the following global hyperparameters temperature τ tuned among {0.25, 0.50, 0.75, 1.0}, number of sort iterations tuned among {2, 5, 10, 20}. We train all models for 200k steps using the default Transformer base hyperparameter. We train our models for 15000 steps for IMDb and SST and 500000 steps for NLI tasks. For all experiments, we use a batch size of 4096 tokens per batch.