Random Feature Attention

Authors: Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, Lingpeng Kong

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines.
Researcher Affiliation Collaboration Hao Peng Nikolaos Pappas Dani Yogatama Roy Schwartz Noah A. Smith Lingpeng Kong Paul G. Allen School of Computer Science & Engineering, University of Washington Deep Mind Allen Institute for Artificial Intelligence School of Computer Science & Engineering, Hebrew University of Jerusalem Department of Computer Science , The University of Hong Kong {hapeng,npappas,nasmith}@cs.washington.edu dyogatama@google.com, roys@cs.huji.ac.il, lpk@cs.hku.hk
Pseudocode Yes Algorithms 1 and 2 describe causal and cross random feature attention s computation procedures.
Open Source Code No The paper does not provide an explicit statement or a link to its own open-source code for the described methodology.
Open Datasets Yes We experiment with Wiki Text-103 (Merity et al., 2017). It is based on English Wikipedia. WMT14 EN-DE and EN-FR (Bojar et al., 2014). Our data split and preprocessing follow those of Vaswani et al. (2017). IWSLT14 DE-EN (Cettolo et al., 2014) is based on TED talks. We further evaluate RFA s accuracy and efficiency when used as text encoders on three NLP tasks from the recently proposed Long Range Arena benchmark (Tay et al., 2021), designed to evaluate efficient Transformer variants on tasks that require processing long sequences. List Ops (LO; Nangia & Bowman, 2018). Character-level text classification with the IMDb movie review dataset (Maas et al., 2011). Character-level document retrieval with the ACL Anthology Network (AAN; Radev et al., 2009) dataset.
Dataset Splits Yes All models are trained for up to 150K gradient steps using the Adam optimizer (Kingma & Ba, 2015). No ℓ2-regularization is used. We apply early stopping based on development set perplexity. Early stopping is applied based on development set BLEU.
Hardware Specification Yes All models are trained using 16 TPU v3 accelerators, and tested using a single TPU v2 accelerator. All models are tested on a single TPU v2 accelerator, with greedy decoding and batch size 16.
Software Dependencies No The paper mentions that its implementation is based on JAX and uses the Adam optimizer, but it does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes All models use a 512 block size during both training and evaluation, i.e., they read as input a segment of 512 consecutive tokens, without access to the context from previous mini-batches. RFA variants use 64-dimensional random feature maps. We experiment with two model size settings, small (around 38M parameters) and big (around 242M parameters). Table 7: Hyperparameters used in the language modeling experiments. Table 8: Hyperparameters used in the machine translation experiments.