Efficient Attention via Control Variates

Authors: Lin Zheng, Jianbo Yuan, Chong Wang, Lingpeng Kong

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our model outperforms state-of-the-art efficient attention mechanisms on both vision and language tasks.
Researcher Affiliation Collaboration Lin Zheng1 Jianbo Yuan2 Chong Wang3 Lingpeng Kong1 1The University of Hong Kong 2Byte Dance Inc. 3Apple Inc.
Pseudocode Yes The pseudo-code of EVA is provided in Algorithm 1 of Appendix. More implementation details, including the definition of ekc and bβc(ω) in Equation 16, are deferred to Appendix C.
Open Source Code Yes 1Our code and models are available at this link.
Open Datasets Yes We evaluate our proposed method on various tasks, including image classification ( 5.1), language tasks ( 5.2), and Long Range Arena benchmark (Appendix F). Details of experimental protocols and baselines can be found in Appendix E. In particular, we evaluate their performance on the Image Net1k dataset (Deng et al., 2009)... Masked language modeling (MLM) on a pretraining-scale book corpus Books3 in the Pile dataset suite (Presser, 2020; Gao et al., 2020)... Machine translation (MT) on WMT14 En-De benchmark (Bojar et al., 2014). Autoregressive language modeling (Autoregressive LM) on a large-scale token-level LM benchmark Wikitext-103 (Merity et al., 2016).
Dataset Splits Yes Image Net1k... contains over 1,280K and 50K images of 1,000 classes for training and validation splits, respectively. For the used corpus Books3, we randomly select 100 books without replacement for the validation split... WMT14 En-De dataset, resulting in around 4.5M/3K/3K English-German sentence pairs for training/validation/testing splits, respectively... Wikitext-103 benchmark, which consists of around 103M/218K/246K tokens for training/validation/testing splits, respectively.
Hardware Specification Yes All of our experiments are conducted with at most 16 NVIDIA V100 GPUs. We benchmark different attention modules on one NVIDIA Ge Force RTX 3090 GPU
Software Dependencies No The paper mentions 'Our implementation for all language tasks is based on Fair Seq toolkit (Ott et al., 2019).' and optimizers like 'Adam W (Loshchilov & Hutter, 2019)'. However, no specific version numbers are provided for these software dependencies or libraries.
Experiment Setup Yes Closely following Dei T Touvron et al. (2021), we employ the Adam W (Loshchilov & Hutter, 2019) optimizer to train models for 300 epochs, where the number of warm-up epochs is 10, the learning rate is 0.001 with cosine learning rate decay (Loshchilov & Hutter, 2016), and batch size is set to 1024. Table 9: Our hyper-parameter configuration for different attention mechanisms on Dei T-Tiny-784. Table 11: Our hyper-parameter configuration for Masked Language Modeling (MLM).