QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion
Authors: Yixing Xu, Chao Li, Dong Li, Xiao Sheng, Fan Jiang, Lu Tian, Emad Barsoum
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the efficiency and effectiveness of the proposed QT-Vi Ts, showcasing the state-of-the-art results. Particularly, the proposed QT-Vi Ts consistently surpass the previous SOTA Efficient Vi Ts under different model sizes, and achieve a new Pareto-front in terms of accuracy and speed. |
| Researcher Affiliation | Industry | Yixing Xu, Chao Li, Dong Li, Xiao Sheng, Fan Jiang, Lu Tian, Emad Barsoum Advanced Micro Devices, Inc., Beijing, China |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block is present in the paper. |
| Open Source Code | No | Answer: [No] Justification: We do not include code. |
| Open Datasets | Yes | The Image Net-1k classification dataset is used for training and evaluation, which contains 1.28M training images and 50K validation images from 1000 different classes. We conduct experiments on the COCO 2017 dataset to further validate the effectiveness of the proposed QT-Vi T models. We further verify the effectiveness of the proposed QT-Vi T on the semantic segmentation task using the ADE20K dataset, |
| Dataset Splits | Yes | The Image Net-1k classification dataset is used for training and evaluation, which contains 1.28M training images and 50K validation images from 1000 different classes. The COCO 2017 dataset has 118K training images, 5K validation images and 20K test-dev images. We further verify the effectiveness of the proposed QT-Vi T on the semantic segmentation task using the ADE20K dataset, which contains 20K training images from 150 semantic categories, 2K validation images and 3K test-dev images. |
| Hardware Specification | Yes | Latencies are evaluated on the AMD Instinct MI250 GPU. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers (e.g., specific Python, PyTorch, or CUDA versions). |
| Experiment Setup | Yes | We utilize the model architecture proposed in Efficient Vi T [3] and replace the kernel function with our proposed compact quadratic Taylor expansion kernel. An absolute positional embedding is added to the key matrix before applying linear attention, and a non-linear shortcut o = o + GELU(BN(v)) is added to the output of the linear attention o where v is the value matrix. Different exponential moving average (EMA) decay parameters are used, and all the other training settings and hyper-parameters remain the same. |