reproducibilityindex.ai

Sparse Sinkhorn Attention

Authors: Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, Da-Cheng Juan

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Via extensive experiments on algorithmic seq2seq sorting, language modeling, pixel-wise image generation, document classiﬁcation and natural language inference, we demonstrate that our memory efﬁcient Sinkhorn Attention method is competitive with vanilla attention and consistently outperforms recently proposed efﬁcient Transformer models such as Sparse Transformers.
Researcher Affiliation	Industry	Yi Tay 1 Dara Bahri 1 Liu Yang 1 Donald Metzler 1 Da-Cheng Juan 1 1Google Research. Correspondence to: Yi Tay <yitay@ google.com>.
Pseudocode	No	The paper describes the methods using mathematical formulations and descriptive text, but does not include a dedicated pseudocode block or algorithm figure.
Open Source Code	No	The paper mentions using the 'open source Tensor2Tensor framework (Vaswani et al., 2018)' but does not provide a link or explicit statement about the release of their own source code for the proposed Sparse Sinkhorn Attention method.
Open Datasets	Yes	We evaluate on the LM1B (Language Modeling One Billion) dataset (Chelba et al., 2013), ... We use the CIFAR-10 dataset. ... IMDb sentiment (Maas et al., 2011) and Sentiment Treebank (SST) dataset (Socher et al., 2013). ... Stanford NLI (Bowman et al., 2015) and Multi NLI (Williams et al., 2017).
Dataset Splits	No	For the algorithmic sort problem, the paper states: 'The dataset consists of 100K train examples and 1000 test examples,' but no explicit validation split is mentioned. For other tasks, the paper refers to using the 'default Tensor2Tensor hyperparameters' or framework without explicitly detailing the validation split percentages or sample counts.
Hardware Specification	Yes	All models are trained for 300K steps on 16 TPU V2 Chips.
Software Dependencies	No	The paper states: 'All our experiments are run on the open source Tensor2Tensor framework (Vaswani et al., 2018),' but no specific version numbers for Tensor2Tensor or other software dependencies are provided.
Experiment Setup	Yes	Our Sinkhorn Transformers adopt the following global hyperparameters temperature τ tuned among {0.25, 0.50, 0.75, 1.0}, number of sort iterations tuned among {2, 5, 10, 20}. We train all models for 200k steps using the default Transformer base hyperparameter. We train our models for 15000 steps for IMDb and SST and 500000 steps for NLI tasks. For all experiments, we use a batch size of 4096 tokens per batch.