Linear Complexity Randomized Self-attention Mechanism
Authors: Lin Zheng, Chong Wang, Lingpeng Kong
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across various domains demonstrate that RA and LARA significantly improve the performance of RFAs by a substantial margin. ... We conduct extensive experiments across various domains to verify the effectiveness of linear randomized attention. Firstly, we start with an experiment to assess the approximation error of different random feature based methods ( 5.1). We then perform a number of experiments on various data modalities, including image classification ( 5.2), video action recognition ( 5.3), machine translation ( 5.4), and long sequence modeling on Long Range Arena benchmark (Appendix I.2). Additional details as well as ablation studies can be found in Appendices H and I. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, The University of Hong Kong 2Byte Dance Inc. 3Shanghai Artificial Intelligence Laboratory. |
| Pseudocode | Yes | Algorithm 1 Randomized Attention (RA) ... Algorithm 2 Random Feature Attention (RFA) ... Algorithm 3 Linear Randomized Attention (LARA) |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about releasing the source code for the proposed methodology. |
| Open Datasets | Yes | For image classification, we conduct our experiment on the Image Net1k benchmark (Deng et al., 2009), which consists of approximately 1,280K/50K images over 1,000 classes for training/validation splits respectively. ... We consider two standard datasets: (1) Kinetics-400 (K400; Kay et al., 2017), which contains 238,574 videos for training and 19,877 for evaluation at the time of writing and (2) Something-something-v2 (SSv2; Goyal et al., 2017), consisting of around 168K/25K videos of 174 classes for training/validation splits respectively. ... We conduct experiments on WMT14 EN DE machine translation benchmark (Bojar et al., 2014) to evaluate the performance of our model under various sequence lengths. We follow Vaswani et al. (2017) and Ott et al. (2018) to preprocess this dataset, resulting in about 4.5M/3K/3K sentences pairs for training/validation/testing splits respectively. |
| Dataset Splits | Yes | Image Net1k benchmark (Deng et al., 2009), which consists of approximately 1,280K/50K images over 1,000 classes for training/validation splits respectively. ... Something-something-v2 (SSv2; Goyal et al., 2017), consisting of around 168K/25K videos of 174 classes for training/validation splits respectively. ... WMT14 EN DE machine translation benchmark (Bojar et al., 2014) ... resulting in about 4.5M/3K/3K sentences pairs for training/validation/testing splits respectively. |
| Hardware Specification | No | The paper mentions running simulations and discusses computational complexity, but it does not specify any particular hardware (e.g., GPU models, CPU types) used for the experiments. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and Py Slow Fast codebase, but it does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | In particular, we use Adam W optimizer (Loshchilov & Hutter, 2019) for 300 epochs, where we set the batch size to 1024 and the learning rate to 0.001 with cosine learning rate decay (Loshchilov & Hutter, 2016). The number of warm-up epochs is set to 10 for all models instead of 5, since we find it often stabilizes training and leads to better results. For data augmentation, we follow Touvron et al. (2021) and use random clipping, cropping, rand-augment (Cubuk et al., 2020) and random erasing (Zhong et al., 2020). ... For regularization, we employ stochastic depth (Huang et al., 2016), Mixup (Zhang et al., 2017), Cutmix (Yun et al., 2019), label smoothing and weight decay, all of which are set to default settings in Dei T (Touvron et al., 2021). |