Random Feature Attention
Authors: Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, Lingpeng Kong
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. |
| Researcher Affiliation | Collaboration | Hao Peng Nikolaos Pappas Dani Yogatama Roy Schwartz Noah A. Smith Lingpeng Kong Paul G. Allen School of Computer Science & Engineering, University of Washington Deep Mind Allen Institute for Artificial Intelligence School of Computer Science & Engineering, Hebrew University of Jerusalem Department of Computer Science , The University of Hong Kong {hapeng,npappas,nasmith}@cs.washington.edu dyogatama@google.com, roys@cs.huji.ac.il, lpk@cs.hku.hk |
| Pseudocode | Yes | Algorithms 1 and 2 describe causal and cross random feature attention s computation procedures. |
| Open Source Code | No | The paper does not provide an explicit statement or a link to its own open-source code for the described methodology. |
| Open Datasets | Yes | We experiment with Wiki Text-103 (Merity et al., 2017). It is based on English Wikipedia. WMT14 EN-DE and EN-FR (Bojar et al., 2014). Our data split and preprocessing follow those of Vaswani et al. (2017). IWSLT14 DE-EN (Cettolo et al., 2014) is based on TED talks. We further evaluate RFA s accuracy and efficiency when used as text encoders on three NLP tasks from the recently proposed Long Range Arena benchmark (Tay et al., 2021), designed to evaluate efficient Transformer variants on tasks that require processing long sequences. List Ops (LO; Nangia & Bowman, 2018). Character-level text classification with the IMDb movie review dataset (Maas et al., 2011). Character-level document retrieval with the ACL Anthology Network (AAN; Radev et al., 2009) dataset. |
| Dataset Splits | Yes | All models are trained for up to 150K gradient steps using the Adam optimizer (Kingma & Ba, 2015). No ℓ2-regularization is used. We apply early stopping based on development set perplexity. Early stopping is applied based on development set BLEU. |
| Hardware Specification | Yes | All models are trained using 16 TPU v3 accelerators, and tested using a single TPU v2 accelerator. All models are tested on a single TPU v2 accelerator, with greedy decoding and batch size 16. |
| Software Dependencies | No | The paper mentions that its implementation is based on JAX and uses the Adam optimizer, but it does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | All models use a 512 block size during both training and evaluation, i.e., they read as input a segment of 512 consecutive tokens, without access to the context from previous mini-batches. RFA variants use 64-dimensional random feature maps. We experiment with two model size settings, small (around 38M parameters) and big (around 242M parameters). Table 7: Hyperparameters used in the language modeling experiments. Table 8: Hyperparameters used in the machine translation experiments. |