You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling
Authors: Zhanpeng Zeng, Yunyang Xiong, Sathya Ravi, Shailesh Acharya, Glenn M Fung, Vikas Singh
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our algorithm on the GLUE benchmark with standard 512 sequence length where we see favorable performance relative to a standard pretrained Transformer. On the Long Range Arena (LRA) benchmark, for evaluating performance on long sequences, our method achieves results consistent with softmax self-attention but with sizable speed-ups and memory savings and often outperforms other efficient self-attention methods. |
| Researcher Affiliation | Collaboration | 1University of Wisconsin, Madison, USA 2University of Illinois, Chicago, USA 3American Family Insurance, Madison, USA. |
| Pseudocode | Yes | Figure 3. Overview of YOSO-Attention algorithm. The hash table stores the sum of values associated with hashed keys. |
| Open Source Code | Yes | Our code is available at https://github.com/mlpen/YOSO. |
| Open Datasets | Yes | We evaluate our algorithm on the GLUE benchmark with standard 512 sequence length...On the Long Range Arena (LRA) benchmark...model is pretrained on Book Corpus (Zhu et al., 2015) and English Wikipedia...MRPC (Dolan & Brockett, 2005), SST-2 (Socher et al., 2013), QNLI (Rajpurkar et al., 2016), QQP (Chen et al., 2018), and MNLI (Williams et al., 2018) tasks...LRA benchmark (Tay et al., 2021) and consists of five tasks: Listops (Nangia & Bowman, 2018), byte-level IMDb reviews classfication (Text) (Maas et al., 2011), byte-level document matching (Retrieval) (Radev et al., 2013), pixel-level CIFAR-10 classification (image) (Krizhevsky et al., 2009), and pixel-level Pathfinder (Linsley et al., 2018). |
| Dataset Splits | Yes | We plot MLM validation perplexity and SOP validation loss curves of 512 length models pretrained with softmax self-attention and YOSO-Attention (Fig. 4 right) and show the MLM validation perplexity and SOP accuracy obtained in Table 2. We finetuned all pretrained BERT-base model on MRPC (Dolan & Brockett, 2005), SST-2 (Socher et al., 2013), QNLI (Rajpurkar et al., 2016), QQP (Chen et al., 2018), and MNLI (Williams et al., 2018) tasks in the GLUE benchmarks and report their corresponding dev metrics. |
| Hardware Specification | Yes | The experiments were performed on a single NVIDIA 2080TI. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the experiments. |
| Experiment Setup | Yes | We use the same hyperparameters for pretraining as Devlin et al. (2019). All models are trained for 500K steps (batch size of 256)... for large datasets including QNLI, QQP, and MNLI, due to extensive resource needs, we did not perform hyperparameter search, so we used a batch size of 32 and learning rate 3e-5 to update our model and finetune our models for 4 epochs. We use a Transformer model of 6 layers, 256 embedding dimension, 1024 hidden dimension, 4 attention heads... |