Sparse Sinkhorn Attention
Authors: Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, Da-Cheng Juan
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Via extensive experiments on algorithmic seq2seq sorting, language modeling, pixel-wise image generation, document classification and natural language inference, we demonstrate that our memory efficient Sinkhorn Attention method is competitive with vanilla attention and consistently outperforms recently proposed efficient Transformer models such as Sparse Transformers. |
| Researcher Affiliation | Industry | Yi Tay 1 Dara Bahri 1 Liu Yang 1 Donald Metzler 1 Da-Cheng Juan 1 1Google Research. Correspondence to: Yi Tay <yitay@ google.com>. |
| Pseudocode | No | The paper describes the methods using mathematical formulations and descriptive text, but does not include a dedicated pseudocode block or algorithm figure. |
| Open Source Code | No | The paper mentions using the 'open source Tensor2Tensor framework (Vaswani et al., 2018)' but does not provide a link or explicit statement about the release of their own source code for the proposed Sparse Sinkhorn Attention method. |
| Open Datasets | Yes | We evaluate on the LM1B (Language Modeling One Billion) dataset (Chelba et al., 2013), ... We use the CIFAR-10 dataset. ... IMDb sentiment (Maas et al., 2011) and Sentiment Treebank (SST) dataset (Socher et al., 2013). ... Stanford NLI (Bowman et al., 2015) and Multi NLI (Williams et al., 2017). |
| Dataset Splits | No | For the algorithmic sort problem, the paper states: 'The dataset consists of 100K train examples and 1000 test examples,' but no explicit validation split is mentioned. For other tasks, the paper refers to using the 'default Tensor2Tensor hyperparameters' or framework without explicitly detailing the validation split percentages or sample counts. |
| Hardware Specification | Yes | All models are trained for 300K steps on 16 TPU V2 Chips. |
| Software Dependencies | No | The paper states: 'All our experiments are run on the open source Tensor2Tensor framework (Vaswani et al., 2018),' but no specific version numbers for Tensor2Tensor or other software dependencies are provided. |
| Experiment Setup | Yes | Our Sinkhorn Transformers adopt the following global hyperparameters temperature τ tuned among {0.25, 0.50, 0.75, 1.0}, number of sort iterations tuned among {2, 5, 10, 20}. We train all models for 200k steps using the default Transformer base hyperparameter. We train our models for 15000 steps for IMDb and SST and 500000 steps for NLI tasks. For all experiments, we use a batch size of 4096 tokens per batch. |