KDEformer: Accelerating Transformers via Kernel Density Estimation

Authors: Amir Zandieh, Insu Han, Majid Daliri, Amin Karbasi

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we verify that KDEformer outperforms other attention approximations in terms of accuracy, memory, and runtime on various pre-trained models. On Big GAN image generation, we achieve better generative scores than the exact computation with over 4 speedup. For Image Net classification with T2T-Vi T, KDEformer shows over 18 speedup while the accuracy drop is less than 0.5%.
Researcher Affiliation Academia 1Max-Planck-Institut für Informatik, Germany 2Yale University, USA 3New York University, USA.
Pseudocode Yes Algorithm 1 KDEformer, Algorithm 2 Weighted Exponential KDE (WEXPKDE), Algorithm 3 Practical Improvement of KDEformer
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the described methodology.
Open Datasets Yes We randomly select a pair of matrices Q, V Rn d from the Glo Ve word embeddings (Pennington et al., 2014)... We use the pre-trained Big GAN on Image Net... Image Net classification with Vision Transformer... Long Range Arena benchmark (Tay et al., 2021)...
Dataset Splits Yes We generate 5,000 fake images and compute the Frechet Inception Distance (FID) with Image Net validation set as ground truth... We compute top-1 accuracy on the Image Net validation dataset
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup Yes model is a 2-layer transformer with 64 embedding dimension, 128 hidden dimension, 2 attention heads, and mean pooling is used for the classification task. Learning rate is set to 10 4 for Text, List Ops, Image and 2 10 4 for the rest. All models are trained for 50,000 steps. Similar to Section 4.1, we choose hyperparameters of all methods having equal feature dimensions to 128.