Proxyformer: Nyström-Based Linear Transformer with Trainable Proxy Tokens
Authors: Sangho Lee, Hayun Lee, Dongkun Shin
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments To compare the performance of Proxyformer with those of other efficient transformer models, we carried out experiments using the Long Range Arena (LRA) benchmark (Tay et al. 2021). |
| Researcher Affiliation | Academia | Sangho Lee, Hayun Lee, Dongkun Shin* Sungkyunkwan University ilena7440@skku.edu, lhy920806@skku.edu, dongkun@skku.edu |
| Pseudocode | Yes | Algorithm 1 Nystr omformer attention |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | To compare the performance of Proxyformer with those of other efficient transformer models, we carried out experiments using the Long Range Arena (LRA) benchmark (Tay et al. 2021). |
| Dataset Splits | No | The paper uses the Long Range Arena (LRA) benchmark but does not explicitly provide the specific training, validation, or test dataset splits or a citation that clearly defines them. |
| Hardware Specification | Yes | We recorded the memory usage per sequence and throughput on a single NVIDIA GeForce RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions "Nystr omformer s LRA Py Torch implementation" but does not specify version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We used Nystr omformer s LRA Py Torch implementation, which employs a two-layer transformer model with 64 embedding dimensions, 128 feed-forward dimensions, and two attention heads. To ensure similar computational complexity across all variants, we set the projection dimension (e.g., # of proxy tokens and # of landmarks) to 128 for all projection-based variants. For Reformer and Bigbird, we used 2-hashing functions and a block size of 64, respectively. The temperature parameter for contrastive loss and dropout probability were set to 0.07 and 0.1, respectively. |