reproducibilityindex.ai

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

Authors: Zhanpeng Zeng, Yunyang Xiong, Sathya Ravi, Shailesh Acharya, Glenn M Fung, Vikas Singh

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our algorithm on the GLUE benchmark with standard 512 sequence length where we see favorable performance relative to a standard pretrained Transformer. On the Long Range Arena (LRA) benchmark, for evaluating performance on long sequences, our method achieves results consistent with softmax self-attention but with sizable speed-ups and memory savings and often outperforms other efﬁcient self-attention methods.
Researcher Affiliation	Collaboration	1University of Wisconsin, Madison, USA 2University of Illinois, Chicago, USA 3American Family Insurance, Madison, USA.
Pseudocode	Yes	Figure 3. Overview of YOSO-Attention algorithm. The hash table stores the sum of values associated with hashed keys.
Open Source Code	Yes	Our code is available at https://github.com/mlpen/YOSO.
Open Datasets	Yes	We evaluate our algorithm on the GLUE benchmark with standard 512 sequence length...On the Long Range Arena (LRA) benchmark...model is pretrained on Book Corpus (Zhu et al., 2015) and English Wikipedia...MRPC (Dolan & Brockett, 2005), SST-2 (Socher et al., 2013), QNLI (Rajpurkar et al., 2016), QQP (Chen et al., 2018), and MNLI (Williams et al., 2018) tasks...LRA benchmark (Tay et al., 2021) and consists of ﬁve tasks: Listops (Nangia & Bowman, 2018), byte-level IMDb reviews classﬁcation (Text) (Maas et al., 2011), byte-level document matching (Retrieval) (Radev et al., 2013), pixel-level CIFAR-10 classiﬁcation (image) (Krizhevsky et al., 2009), and pixel-level Pathﬁnder (Linsley et al., 2018).
Dataset Splits	Yes	We plot MLM validation perplexity and SOP validation loss curves of 512 length models pretrained with softmax self-attention and YOSO-Attention (Fig. 4 right) and show the MLM validation perplexity and SOP accuracy obtained in Table 2. We ﬁnetuned all pretrained BERT-base model on MRPC (Dolan & Brockett, 2005), SST-2 (Socher et al., 2013), QNLI (Rajpurkar et al., 2016), QQP (Chen et al., 2018), and MNLI (Williams et al., 2018) tasks in the GLUE benchmarks and report their corresponding dev metrics.
Hardware Specification	Yes	The experiments were performed on a single NVIDIA 2080TI.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the experiments.
Experiment Setup	Yes	We use the same hyperparameters for pretraining as Devlin et al. (2019). All models are trained for 500K steps (batch size of 256)... for large datasets including QNLI, QQP, and MNLI, due to extensive resource needs, we did not perform hyperparameter search, so we used a batch size of 32 and learning rate 3e-5 to update our model and ﬁnetune our models for 4 epochs. We use a Transformer model of 6 layers, 256 embedding dimension, 1024 hidden dimension, 4 attention heads...