Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rethinking Attention with Performers

Authors: Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, Adrian Weller

ICLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
Researcher Affiliation Collaboration 1Google 2University of Cambridge 3Deep Mind 4Alan Turing Institute
Pseudocode Yes The pseudocode of the entire FAVOR+ algorithm is given in Appendix B.
Open Source Code Yes Code for Transformer models on protein data can be found in github.com/google-research/ google-research/tree/master/protein_lm and Performer code can be found in github.com/ google-research/google-research/tree/master/performer.
Open Datasets Yes We used the Tr EMBL dataset4, which contains 139,394,261 sequences of which 106,030,080 are unique. (Footnote 4: https://www.uniprot.org/statistics/Tr EMBL) ...PG-19 (Rae et al., 2020)... Image Net64 benchmark from (Parmar et al., 2018)...
Dataset Splits Yes The original dataset token count per split is: train=1973136207, validation=3007061, test=6966499.
Hardware Specification Yes on a V100 GPU with 16GB. ...all Tr EMBL experiments used 16x16 TPU-v2 s. ...Benchmarking is run on 4x4 TPU-v3 chips.
Software Dependencies No The paper states 'We implemented our setup on top of pre-existing Transformer training code in Jax (Frostig et al., 2018) optimized with just-in-time (jax.jit) compilation', but it does not specify version numbers for Jax or any other software libraries or dependencies.
Experiment Setup Yes Unless specifically stated, all Performer + Transformer runs by default used 0.5 grad clip, 0.1 weight decay, 0.1 dropout, 10 3 fixed learning rate with Adam hyperparameters (β1 = 0.9, β2 = 0.98, ϵ = 10 9), with batch size maximized (until TPU memory overload) for a specific model.