QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion

Authors: Yixing Xu, Chao Li, Dong Li, Xiao Sheng, Fan Jiang, Lu Tian, Emad Barsoum

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the efficiency and effectiveness of the proposed QT-Vi Ts, showcasing the state-of-the-art results. Particularly, the proposed QT-Vi Ts consistently surpass the previous SOTA Efficient Vi Ts under different model sizes, and achieve a new Pareto-front in terms of accuracy and speed.
Researcher Affiliation Industry Yixing Xu, Chao Li, Dong Li, Xiao Sheng, Fan Jiang, Lu Tian, Emad Barsoum Advanced Micro Devices, Inc., Beijing, China
Pseudocode No No pseudocode or clearly labeled algorithm block is present in the paper.
Open Source Code No Answer: [No] Justification: We do not include code.
Open Datasets Yes The Image Net-1k classification dataset is used for training and evaluation, which contains 1.28M training images and 50K validation images from 1000 different classes. We conduct experiments on the COCO 2017 dataset to further validate the effectiveness of the proposed QT-Vi T models. We further verify the effectiveness of the proposed QT-Vi T on the semantic segmentation task using the ADE20K dataset,
Dataset Splits Yes The Image Net-1k classification dataset is used for training and evaluation, which contains 1.28M training images and 50K validation images from 1000 different classes. The COCO 2017 dataset has 118K training images, 5K validation images and 20K test-dev images. We further verify the effectiveness of the proposed QT-Vi T on the semantic segmentation task using the ADE20K dataset, which contains 20K training images from 150 semantic categories, 2K validation images and 3K test-dev images.
Hardware Specification Yes Latencies are evaluated on the AMD Instinct MI250 GPU.
Software Dependencies No The paper does not specify software dependencies with version numbers (e.g., specific Python, PyTorch, or CUDA versions).
Experiment Setup Yes We utilize the model architecture proposed in Efficient Vi T [3] and replace the kernel function with our proposed compact quadratic Taylor expansion kernel. An absolute positional embedding is added to the key matrix before applying linear attention, and a non-linear shortcut o = o + GELU(BN(v)) is added to the output of the linear attention o where v is the value matrix. Different exponential moving average (EMA) decay parameters are used, and all the other training settings and hyper-parameters remain the same.