reproducibilityindex.ai

Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method

Authors: Yifan Chen, Qi Zeng, Heng Ji, Yun Yang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Long Range Arena benchmark show that the proposed method is sufﬁcient in getting comparable or even better performance than the full self-attention while requiring fewer computation resources.
Researcher Affiliation	Academia	University of Illinois Urbana-Champaign {yifanc10, qizeng2, hengji, yy84}@illinois.edu
Pseudocode	No	The paper describes mathematical derivations and steps of the method but does not include a formally labeled pseudocode or algorithm block.
Open Source Code	Yes	Our code is released at https://github.com/pkuzengqi/Skyformer
Open Datasets	Yes	We evaluate the proposed methods on ﬁve classiﬁcation tasks on LRA benchmark [Tay et al., 2020b], which focuses on model quality under long-context scenarios: List Ops [Nangia and Bowman, 2018], Text Classiﬁcation on IMDb review dataset [Maas et al., 2011], Document Retrieval on AAN dataset [Radev et al., 2013], Pathﬁnder [Linsley et al., 2018], and Image Classiﬁcation on CIFAR-10 [Krizhevsky et al., 2009].
Dataset Splits	Yes	Each model on each task is trained for 50k steps, during which the best checkpoint with the highest accuracy on the development set will be saved for evaluation.
Hardware Specification	Yes	We conduct each experiment on one Tesla V100 SXM2 16GB.
Software Dependencies	No	The paper mentions using "Py Torch" and "Huggingface s implementation", but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We use a 2-layer transformer model with 64 embedding dimension, 128 hidden dimension, 2 attention heads, and mean pooling for classiﬁcation. Batch size is selected conditioned on the memory requirements of the standard self-attention method, which leads to 16 for Text Classiﬁcation, 32 for List Ops, 16 for Document Retrieval, 128 for Pathﬁnder, and 256 for Image Classiﬁcation. Learning rate is set to 1e 4 for Text Classiﬁcation, List Ops, and Image Classiﬁcation, and 2e 4 for Retrieval and Pathﬁnder. Each model on each task is trained for 50k steps, during which the best checkpoint with the highest accuracy on the development set will be saved for evaluation. For comparable computation complexity, we control the number of features to be 128 used in all methods (except Big Bird), under which setting the models will visit 128 n elements in the attention matrix.