Proxyformer: Nyström-Based Linear Transformer with Trainable Proxy Tokens

Authors: Sangho Lee, Hayun Lee, Dongkun Shin

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments To compare the performance of Proxyformer with those of other efficient transformer models, we carried out experiments using the Long Range Arena (LRA) benchmark (Tay et al. 2021).
Researcher Affiliation Academia Sangho Lee, Hayun Lee, Dongkun Shin* Sungkyunkwan University ilena7440@skku.edu, lhy920806@skku.edu, dongkun@skku.edu
Pseudocode Yes Algorithm 1 Nystr omformer attention
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes To compare the performance of Proxyformer with those of other efficient transformer models, we carried out experiments using the Long Range Arena (LRA) benchmark (Tay et al. 2021).
Dataset Splits No The paper uses the Long Range Arena (LRA) benchmark but does not explicitly provide the specific training, validation, or test dataset splits or a citation that clearly defines them.
Hardware Specification Yes We recorded the memory usage per sequence and throughput on a single NVIDIA GeForce RTX 3090 GPU.
Software Dependencies No The paper mentions "Nystr omformer s LRA Py Torch implementation" but does not specify version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes We used Nystr omformer s LRA Py Torch implementation, which employs a two-layer transformer model with 64 embedding dimensions, 128 feed-forward dimensions, and two attention heads. To ensure similar computational complexity across all variants, we set the projection dimension (e.g., # of proxy tokens and # of landmarks) to 128 for all projection-based variants. For Reformer and Bigbird, we used 2-hashing functions and a block size of 64, respectively. The temperature parameter for contrastive loss and dropout probability were set to 0.07 and 0.1, respectively.