Scatterbrain: Unifying Sparse and Low-rank Attention
Authors: Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that Scatterbrain can achieve 2.1 lower error than baselines when serving as a drop-in replacement in Big GAN image generation and pre-trained T2T-Vi T. On a pre-trained T2T Vision transformer, even without fine-tuning, Scatterbrain can reduce 98% of attention memory at the cost of only 1% drop in accuracy. We demonstrate Scatterbrain for end-to-end training with up to 4 points better perplexity and 5 points better average accuracy than sparse or low-rank efficient transformers on language modeling and long-range-arena tasks. |
| Researcher Affiliation | Collaboration | Beidi Chen , Tri Dao , Eric Winsor , Zhao Song , Atri Rudra , Christopher Ré Department of Computer Science, Stanford University Adobe Research Department of Computer Science and Engineering, University at Buffalo, SUNY |
| Pseudocode | No | Appendix C describes the precise algorithm, including normalization and causal/unidirectional variants, in prose but does not present a formally labeled pseudocode block or algorithm steps formatted like code. |
| Open Source Code | Yes | Scatterbrain code is available at https://github.com/Hazy Research/scatterbrain |
| Open Datasets | Yes | T2T-Vi T [70], which is a token-to-token vision Transformer pre-trained on Image Net [25]... The datasets are obtained from the Long Range Arena (LRA) Benchmark [58]... On the standard language modeling task of Wikitext-103 [46]... |
| Dataset Splits | Yes | All details (hyperparameters, data splits, etc.), along with additional experiments, are in Appendix E... We follow the same data preprocessing and splits as in [58] and [46]. For the ImageNet dataset, we use the standard splits. We refer readers to the original dataset papers [46, 58, 25] for more details. |
| Hardware Specification | Yes | We use a batch size of 16 for all runs and conduct experiments a V100 GPU. |
| Software Dependencies | No | The paper mentions: "We adapt the Pytorch implementation from pytorch-fast-transformers library for our baselines and implement Scatterbrain similarly without any customized cuda kernels." However, it does not specify version numbers for PyTorch or the pytorch-fast-transformers library, which are necessary for reproducibility. |
| Experiment Setup | Yes | All details (hyperparameters, data splits, etc.), along with additional experiments, are in Appendix E... The base model for the language modeling experiments on WikiText-103 is a Transformer with 16 layers, 8 heads, and a hidden dimension of 512. We use a sequence length of 2048. We train for 100 epochs using the Adam optimizer with a learning rate of 1e-4 and a batch size of 12. For the Copy task, we use a simple Transformer model with 2 layers, 4 heads, and a hidden dimension of 256. We train for 50 epochs with a learning rate of 1e-4 and a batch size of 32. For LRA, we use a Transformer with 4 layers, 8 heads, and a hidden dimension of 256. We use a learning rate of 1e-3 and a batch size of 64. |