Scatterbrain: Unifying Sparse and Low-rank Attention

Authors: Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that Scatterbrain can achieve 2.1 lower error than baselines when serving as a drop-in replacement in Big GAN image generation and pre-trained T2T-Vi T. On a pre-trained T2T Vision transformer, even without fine-tuning, Scatterbrain can reduce 98% of attention memory at the cost of only 1% drop in accuracy. We demonstrate Scatterbrain for end-to-end training with up to 4 points better perplexity and 5 points better average accuracy than sparse or low-rank efficient transformers on language modeling and long-range-arena tasks.
Researcher Affiliation Collaboration Beidi Chen , Tri Dao , Eric Winsor , Zhao Song , Atri Rudra , Christopher Ré Department of Computer Science, Stanford University Adobe Research Department of Computer Science and Engineering, University at Buffalo, SUNY
Pseudocode No Appendix C describes the precise algorithm, including normalization and causal/unidirectional variants, in prose but does not present a formally labeled pseudocode block or algorithm steps formatted like code.
Open Source Code Yes Scatterbrain code is available at https://github.com/Hazy Research/scatterbrain
Open Datasets Yes T2T-Vi T [70], which is a token-to-token vision Transformer pre-trained on Image Net [25]... The datasets are obtained from the Long Range Arena (LRA) Benchmark [58]... On the standard language modeling task of Wikitext-103 [46]...
Dataset Splits Yes All details (hyperparameters, data splits, etc.), along with additional experiments, are in Appendix E... We follow the same data preprocessing and splits as in [58] and [46]. For the ImageNet dataset, we use the standard splits. We refer readers to the original dataset papers [46, 58, 25] for more details.
Hardware Specification Yes We use a batch size of 16 for all runs and conduct experiments a V100 GPU.
Software Dependencies No The paper mentions: "We adapt the Pytorch implementation from pytorch-fast-transformers library for our baselines and implement Scatterbrain similarly without any customized cuda kernels." However, it does not specify version numbers for PyTorch or the pytorch-fast-transformers library, which are necessary for reproducibility.
Experiment Setup Yes All details (hyperparameters, data splits, etc.), along with additional experiments, are in Appendix E... The base model for the language modeling experiments on WikiText-103 is a Transformer with 16 layers, 8 heads, and a hidden dimension of 512. We use a sequence length of 2048. We train for 100 epochs using the Adam optimizer with a learning rate of 1e-4 and a batch size of 12. For the Copy task, we use a simple Transformer model with 2 layers, 4 heads, and a hidden dimension of 256. We train for 50 epochs with a learning rate of 1e-4 and a batch size of 32. For LRA, we use a Transformer with 4 layers, 8 heads, and a hidden dimension of 256. We use a learning rate of 1e-3 and a batch size of 64.