Chefs' Random Tables: Non-Trigonometric Random Features

Authors: Valerii Likhosherstov, Krzysztof M Choromanski, Kumar Avinava Dubey, Frederick Liu, Tamas Sarlos, Adrian Weller

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test CRTs on many tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new state-of-the-art results for low-rank text Transformers, while providing linear space and time complexity of the attention. We present an extensive empirical evaluation of CRTs. Additional details and results for each experiment can be found in the Appendix 9.10.
Researcher Affiliation Collaboration Valerii Likhosherstov* University of Cambridge vl304@cam.ac.uk Krzysztof Choromanski* Google Research & Columbia University kchoro@google.com Avinava Dubey* Google Research Frederick Liu* Google Research Tamas Sarlos Google Research Adrian Weller University of Cambridge & The Alan Turing Institute
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We include the part of the code that is not confidential, the core CRT variant: FAVOR++ mechanism.
Open Datasets Yes We evaluate on classification benchmarks from UCI Repository [24] (Table 1)... General Language Understanding Evaluation (GLUE) benchmark [57]... Libri Speech ASR corpus ([42])... Image Net ([18]).
Dataset Splits Yes hyperparameter tuned on the validation set. For GLUE tasks, we use standard splits: training and development splits from the BERT repository.
Hardware Specification Yes For GLUE training, we used 8x A100 GPUs.
Software Dependencies No All code is written in JAX/NumPy [6, 28].
Experiment Setup Yes Batch size 128 (on 8 A100 GPUs that is 16 examples per GPU). Learning rate 2e-4. We train for 10 epochs. We use Adam optimizer with β1 = 0.9, β2 = 0.999, ϵ = 1e-6.