Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method
Authors: Yifan Chen, Qi Zeng, Heng Ji, Yun Yang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Long Range Arena benchmark show that the proposed method is sufficient in getting comparable or even better performance than the full self-attention while requiring fewer computation resources. |
| Researcher Affiliation | Academia | University of Illinois Urbana-Champaign {yifanc10, qizeng2, hengji, yy84}@illinois.edu |
| Pseudocode | No | The paper describes mathematical derivations and steps of the method but does not include a formally labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Our code is released at https://github.com/pkuzengqi/Skyformer |
| Open Datasets | Yes | We evaluate the proposed methods on five classification tasks on LRA benchmark [Tay et al., 2020b], which focuses on model quality under long-context scenarios: List Ops [Nangia and Bowman, 2018], Text Classification on IMDb review dataset [Maas et al., 2011], Document Retrieval on AAN dataset [Radev et al., 2013], Pathfinder [Linsley et al., 2018], and Image Classification on CIFAR-10 [Krizhevsky et al., 2009]. |
| Dataset Splits | Yes | Each model on each task is trained for 50k steps, during which the best checkpoint with the highest accuracy on the development set will be saved for evaluation. |
| Hardware Specification | Yes | We conduct each experiment on one Tesla V100 SXM2 16GB. |
| Software Dependencies | No | The paper mentions using "Py Torch" and "Huggingface s implementation", but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We use a 2-layer transformer model with 64 embedding dimension, 128 hidden dimension, 2 attention heads, and mean pooling for classification. Batch size is selected conditioned on the memory requirements of the standard self-attention method, which leads to 16 for Text Classification, 32 for List Ops, 16 for Document Retrieval, 128 for Pathfinder, and 256 for Image Classification. Learning rate is set to 1e 4 for Text Classification, List Ops, and Image Classification, and 2e 4 for Retrieval and Pathfinder. Each model on each task is trained for 50k steps, during which the best checkpoint with the highest accuracy on the development set will be saved for evaluation. For comparable computation complexity, we control the number of features to be 128 used in all methods (except Big Bird), under which setting the models will visit 128 n elements in the attention matrix. |