ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

Authors: Haoran You, Huihong Shi, Yipin Guo, Yingyan Lin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various 2D/3D Transformer-based vision tasks consistently validate the effectiveness of our proposed Shift Add Vi T, achieving up to 5.18 latency reductions on GPUs and 42.9% energy savings, while maintaining a comparable accuracy as original or efficient Vi Ts.
Researcher Affiliation Academia Haoran You*, Huihong Shi*, Yipin Guo*, and Yingyan (Celine) Lin Georgia Institute of Technology, Atlanta, GA {haoran.you, celine.lin}@gatech.edu, eic-lab@groups.gatech.edu
Pseudocode No The paper uses figures (e.g., Figure 1, Figure 2) to illustrate the network architecture and operations but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Codes and models are available at https://github.com/GATECH-EIC/ShiftAddViT.
Open Datasets Yes We consider two representative 2D and 3D Transformerbased vision tasks to demonstrate the superiority of the proposed Shift Add Vi T, including 2D image classification on Image Net dataset [15] with 1.2 million training and 50K validation images and 3D novel view synthesis (NVS) task on Local Light Field Fusion (LLFF) dataset [40] with eight scenes.
Dataset Splits Yes including 2D image classification on Image Net dataset [15] with 1.2 million training and 50K validation images
Hardware Specification Yes All experiments are run on a server with eight RTX A5000 GPUs with each having 24GB GPU memory.
Software Dependencies No The paper mentions Py Torch [42] and TVM [10] as software used for implementation and optimization, but specific version numbers for these or other software dependencies are not provided.
Experiment Setup Yes For the classification task, we follow Ecoformer [34] to initialize the pre-trained Vi Ts with Multihead Self-Attention (MSA) weights, based on which we apply our reparameterization a two-stage finetuning: (1) convert MSA to linear attention [73] and reparameterize all Mat Muls with add layers with 100 epoch finetuning, and (2) reparameterize MLPs or linear layers with shift or Mo E layers after finetuning another 100 epoch. ... All experiments are run on a server with eight RTX A5000 GPUs with a total batch size of 256, and we use the Adam W optimizer [37] with a cosine decay lr scheduler. ... For the NVS task, we still follow the two-stage finetuning but do not convert MSA weights to linear attention to maintain the accuracy. ... Both stages are finetuned for 140K steps with a base lr of 5 10 4, and we sample 2048 rays with 192 coarse points sampled per ray in each iteration. All other hyperparameters are the same as those in GNT [53], including the use of the Adam optimizer with an exponential decay lr scheduler.