reproducibilityindex.ai

QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion

Authors: Yixing Xu, Chao Li, Dong Li, Xiao Sheng, Fan Jiang, Lu Tian, Emad Barsoum

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the efficiency and effectiveness of the proposed QT-Vi Ts, showcasing the state-of-the-art results. Particularly, the proposed QT-Vi Ts consistently surpass the previous SOTA Efficient Vi Ts under different model sizes, and achieve a new Pareto-front in terms of accuracy and speed.
Researcher Affiliation	Industry	Yixing Xu, Chao Li, Dong Li, Xiao Sheng, Fan Jiang, Lu Tian, Emad Barsoum Advanced Micro Devices, Inc., Beijing, China
Pseudocode	No	No pseudocode or clearly labeled algorithm block is present in the paper.
Open Source Code	No	Answer: [No] Justification: We do not include code.
Open Datasets	Yes	The Image Net-1k classification dataset is used for training and evaluation, which contains 1.28M training images and 50K validation images from 1000 different classes. We conduct experiments on the COCO 2017 dataset to further validate the effectiveness of the proposed QT-Vi T models. We further verify the effectiveness of the proposed QT-Vi T on the semantic segmentation task using the ADE20K dataset,
Dataset Splits	Yes	The Image Net-1k classification dataset is used for training and evaluation, which contains 1.28M training images and 50K validation images from 1000 different classes. The COCO 2017 dataset has 118K training images, 5K validation images and 20K test-dev images. We further verify the effectiveness of the proposed QT-Vi T on the semantic segmentation task using the ADE20K dataset, which contains 20K training images from 150 semantic categories, 2K validation images and 3K test-dev images.
Hardware Specification	Yes	Latencies are evaluated on the AMD Instinct MI250 GPU.
Software Dependencies	No	The paper does not specify software dependencies with version numbers (e.g., specific Python, PyTorch, or CUDA versions).
Experiment Setup	Yes	We utilize the model architecture proposed in Efficient Vi T [3] and replace the kernel function with our proposed compact quadratic Taylor expansion kernel. An absolute positional embedding is added to the key matrix before applying linear attention, and a non-linear shortcut o = o + GELU(BN(v)) is added to the output of the linear attention o where v is the value matrix. Different exponential moving average (EMA) decay parameters are used, and all the other training settings and hyper-parameters remain the same.