KDEformer: Accelerating Transformers via Kernel Density Estimation
Authors: Amir Zandieh, Insu Han, Majid Daliri, Amin Karbasi
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we verify that KDEformer outperforms other attention approximations in terms of accuracy, memory, and runtime on various pre-trained models. On Big GAN image generation, we achieve better generative scores than the exact computation with over 4 speedup. For Image Net classification with T2T-Vi T, KDEformer shows over 18 speedup while the accuracy drop is less than 0.5%. |
| Researcher Affiliation | Academia | 1Max-Planck-Institut für Informatik, Germany 2Yale University, USA 3New York University, USA. |
| Pseudocode | Yes | Algorithm 1 KDEformer, Algorithm 2 Weighted Exponential KDE (WEXPKDE), Algorithm 3 Practical Improvement of KDEformer |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for the described methodology. |
| Open Datasets | Yes | We randomly select a pair of matrices Q, V Rn d from the Glo Ve word embeddings (Pennington et al., 2014)... We use the pre-trained Big GAN on Image Net... Image Net classification with Vision Transformer... Long Range Arena benchmark (Tay et al., 2021)... |
| Dataset Splits | Yes | We generate 5,000 fake images and compute the Frechet Inception Distance (FID) with Image Net validation set as ground truth... We compute top-1 accuracy on the Image Net validation dataset |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | model is a 2-layer transformer with 64 embedding dimension, 128 hidden dimension, 2 attention heads, and mean pooling is used for the classification task. Learning rate is set to 10 4 for Text, List Ops, Image and 2 10 4 for the rest. All models are trained for 50,000 steps. Similar to Section 4.1, we choose hyperparameters of all methods having equal feature dimensions to 128. |