Rethinking Attention with Performers

Authors: Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, Adrian Weller

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
Researcher Affiliation Collaboration 1Google 2University of Cambridge 3Deep Mind 4Alan Turing Institute
Pseudocode Yes The pseudocode of the entire FAVOR+ algorithm is given in Appendix B.
Open Source Code Yes Code for Transformer models on protein data can be found in github.com/google-research/ google-research/tree/master/protein_lm and Performer code can be found in github.com/ google-research/google-research/tree/master/performer.
Open Datasets Yes We used the Tr EMBL dataset4, which contains 139,394,261 sequences of which 106,030,080 are unique. (Footnote 4: https://www.uniprot.org/statistics/Tr EMBL) ...PG-19 (Rae et al., 2020)... Image Net64 benchmark from (Parmar et al., 2018)...
Dataset Splits Yes The original dataset token count per split is: train=1973136207, validation=3007061, test=6966499.
Hardware Specification Yes on a V100 GPU with 16GB. ...all Tr EMBL experiments used 16x16 TPU-v2 s. ...Benchmarking is run on 4x4 TPU-v3 chips.
Software Dependencies No The paper states 'We implemented our setup on top of pre-existing Transformer training code in Jax (Frostig et al., 2018) optimized with just-in-time (jax.jit) compilation', but it does not specify version numbers for Jax or any other software libraries or dependencies.
Experiment Setup Yes Unless specifically stated, all Performer + Transformer runs by default used 0.5 grad clip, 0.1 weight decay, 0.1 dropout, 10 3 fixed learning rate with Adam hyperparameters (β1 = 0.9, β2 = 0.98, ϵ = 10 9), with batch size maximized (until TPU memory overload) for a specific model.