Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention

Authors: Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh14138-14148

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nystr omformer performs comparably, or in a few cases, even slightly better, than standard self-attention. On longer sequence tasks in the Long Range Arena (LRA) benchmark, Nystr omformer performs favorably relative to other efficient self-attention methods.
Researcher Affiliation Collaboration 1 University of Wisconsin-Madison 2 UC Berkeley 3 Google Brain 4 American Family Insurance
Pseudocode Yes Algorithm 1: Pipeline for Nystr om approximation of softmax matrix in self-attention
Open Source Code Yes Our code is available at https://github.com/mlpen/Nystromformer.
Open Datasets Yes We consider Book Corpus plus English Wikipedia as the training corpus, which is further split into training (80%) and validation (20%) sets. Our model is trained using the training set. We report the masked-language-modeling (MLM) and sentence-order-prediction (SOP) accuracy on the validation set, and compare the efficiency (runtime/memory) to a baseline. Baselines. Our baseline is the well-known Transformer based model BERT (Devlin et al. 2019). Specifically, we consider two variants of BERT: BERT-small is a light weighted BERT model with 4 layers. We use BERT-small to compare to linear Transformers, including ELU linearized self-attention (Katharopoulos et al. 2020) and Linformer (Wang et al. 2020). BERT-base is the base model from (Devlin et al. 2019). We use this model as our baseline when fine-tuning on downstream NLP tasks. Our Nystr omformer replaces the self-attention in BERTsmall and BERT-base using the proposed Nystr om approximation. We acknowledge that several very recent articles (Zaheer et al. 2020; Beltagy, Peters, and Cohan 2020), concurrent with our work, have also proposed efficient O(n) self-attention for Transformers. An exhaustive comparison to a rapidly growing set of algorithms is prohibitive unless extensive compute resources are freely available. Thus, we only compare runtime performance and the memory consumption of our method to Linformer (Wang et al. 2020) and Longformer (Beltagy, Peters, and Cohan 2020) in Table 1.
Dataset Splits Yes We consider Book Corpus plus English Wikipedia as the training corpus, which is further split into training (80%) and validation (20%) sets.
Hardware Specification Yes For example, training a BERT-large model (Devlin et al. 2019) will need 4 months using a single Tesla V100 GPU (equivalent to 4 days using a 4x4 TPU pod). All models are evaluated on the same machine setting with a Nvidia 1080Ti and we report the improved inference speed and memory savings.
Software Dependencies No The official LRA benchmark (Tay et al. 2020) is implemented in Jax/Flax (Frostig, Johnson, and Leary 2018). To achieve a fair comparison to our baselines implemented in PyTorch, we reimplemented the benchmark in Py Torch and verified the results. All our experiments, including our method and all baselines, use a Transformer model with 2 layers, 64 embedding dimension, 128 hidden dimension, 2 attention heads. Mean pooling is used for all tasks. The number of hashes for Reformer is 2, the projection dimension for Linformer is 256, and random feature dimension for Performer is 256. The paper mentions PyTorch and Jax/Flax but does not provide specific version numbers for these software components.
Experiment Setup Yes Our model is pre-trained with the masked-language-modeling (MLM) and sentence-order-prediction (SOP) objectives (Lan et al. 2020). We use a batch size of 256, Adam optimizer with learning rate 1e-4, β1 = 0.9, β2 = 0.999, L2 weight decay of 0.01, learning rate warm-up over the first 10,000 steps, and linear learning rate decay to update our model. Training BERTbase with 1M update steps takes more than one week on 8 V100 GPUs. To keep compute costs reasonable, our baseline (BERT-base) and our model are trained with 0.5M steps. We also train our model with 0.25M steps, initialized from pre-trained BERT-base for speed-up. For BERT-small, we train for 0.1M steps. More details are in the appendix. For larger datasets (SST-2, QNLI, QQP, MMNL, IMDB reviews), we use a batch size of 32 and the Adam W optimizer with learning rate 3e-5 and fine-tune our models for 4 epochs.