Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method

Authors: Yifan Chen, Qi Zeng, Heng Ji, Yun Yang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Long Range Arena benchmark show that the proposed method is sufficient in getting comparable or even better performance than the full self-attention while requiring fewer computation resources.
Researcher Affiliation Academia University of Illinois Urbana-Champaign {yifanc10, qizeng2, hengji, yy84}@illinois.edu
Pseudocode No The paper describes mathematical derivations and steps of the method but does not include a formally labeled pseudocode or algorithm block.
Open Source Code Yes Our code is released at https://github.com/pkuzengqi/Skyformer
Open Datasets Yes We evaluate the proposed methods on five classification tasks on LRA benchmark [Tay et al., 2020b], which focuses on model quality under long-context scenarios: List Ops [Nangia and Bowman, 2018], Text Classification on IMDb review dataset [Maas et al., 2011], Document Retrieval on AAN dataset [Radev et al., 2013], Pathfinder [Linsley et al., 2018], and Image Classification on CIFAR-10 [Krizhevsky et al., 2009].
Dataset Splits Yes Each model on each task is trained for 50k steps, during which the best checkpoint with the highest accuracy on the development set will be saved for evaluation.
Hardware Specification Yes We conduct each experiment on one Tesla V100 SXM2 16GB.
Software Dependencies No The paper mentions using "Py Torch" and "Huggingface s implementation", but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We use a 2-layer transformer model with 64 embedding dimension, 128 hidden dimension, 2 attention heads, and mean pooling for classification. Batch size is selected conditioned on the memory requirements of the standard self-attention method, which leads to 16 for Text Classification, 32 for List Ops, 16 for Document Retrieval, 128 for Pathfinder, and 256 for Image Classification. Learning rate is set to 1e 4 for Text Classification, List Ops, and Image Classification, and 2e 4 for Retrieval and Pathfinder. Each model on each task is trained for 50k steps, during which the best checkpoint with the highest accuracy on the development set will be saved for evaluation. For comparable computation complexity, we control the number of features to be 128 used in all methods (except Big Bird), under which setting the models will visit 128 n elements in the attention matrix.