You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

Authors: Zhanpeng Zeng, Yunyang Xiong, Sathya Ravi, Shailesh Acharya, Glenn M Fung, Vikas Singh

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our algorithm on the GLUE benchmark with standard 512 sequence length where we see favorable performance relative to a standard pretrained Transformer. On the Long Range Arena (LRA) benchmark, for evaluating performance on long sequences, our method achieves results consistent with softmax self-attention but with sizable speed-ups and memory savings and often outperforms other efficient self-attention methods.
Researcher Affiliation Collaboration 1University of Wisconsin, Madison, USA 2University of Illinois, Chicago, USA 3American Family Insurance, Madison, USA.
Pseudocode Yes Figure 3. Overview of YOSO-Attention algorithm. The hash table stores the sum of values associated with hashed keys.
Open Source Code Yes Our code is available at https://github.com/mlpen/YOSO.
Open Datasets Yes We evaluate our algorithm on the GLUE benchmark with standard 512 sequence length...On the Long Range Arena (LRA) benchmark...model is pretrained on Book Corpus (Zhu et al., 2015) and English Wikipedia...MRPC (Dolan & Brockett, 2005), SST-2 (Socher et al., 2013), QNLI (Rajpurkar et al., 2016), QQP (Chen et al., 2018), and MNLI (Williams et al., 2018) tasks...LRA benchmark (Tay et al., 2021) and consists of five tasks: Listops (Nangia & Bowman, 2018), byte-level IMDb reviews classfication (Text) (Maas et al., 2011), byte-level document matching (Retrieval) (Radev et al., 2013), pixel-level CIFAR-10 classification (image) (Krizhevsky et al., 2009), and pixel-level Pathfinder (Linsley et al., 2018).
Dataset Splits Yes We plot MLM validation perplexity and SOP validation loss curves of 512 length models pretrained with softmax self-attention and YOSO-Attention (Fig. 4 right) and show the MLM validation perplexity and SOP accuracy obtained in Table 2. We finetuned all pretrained BERT-base model on MRPC (Dolan & Brockett, 2005), SST-2 (Socher et al., 2013), QNLI (Rajpurkar et al., 2016), QQP (Chen et al., 2018), and MNLI (Williams et al., 2018) tasks in the GLUE benchmarks and report their corresponding dev metrics.
Hardware Specification Yes The experiments were performed on a single NVIDIA 2080TI.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the experiments.
Experiment Setup Yes We use the same hyperparameters for pretraining as Devlin et al. (2019). All models are trained for 500K steps (batch size of 256)... for large datasets including QNLI, QQP, and MNLI, due to extensive resource needs, we did not perform hyperparameter search, so we used a batch size of 32 and learning rate 3e-5 to update our model and finetune our models for 4 epochs. We use a Transformer model of 6 layers, 256 embedding dimension, 1024 hidden dimension, 4 attention heads...