ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer
Authors: Haoran You, Huihong Shi, Yipin Guo, Yingyan Lin
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on various 2D/3D Transformer-based vision tasks consistently validate the effectiveness of our proposed Shift Add Vi T, achieving up to 5.18 latency reductions on GPUs and 42.9% energy savings, while maintaining a comparable accuracy as original or efficient Vi Ts. |
| Researcher Affiliation | Academia | Haoran You*, Huihong Shi*, Yipin Guo*, and Yingyan (Celine) Lin Georgia Institute of Technology, Atlanta, GA {haoran.you, celine.lin}@gatech.edu, eic-lab@groups.gatech.edu |
| Pseudocode | No | The paper uses figures (e.g., Figure 1, Figure 2) to illustrate the network architecture and operations but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes and models are available at https://github.com/GATECH-EIC/ShiftAddViT. |
| Open Datasets | Yes | We consider two representative 2D and 3D Transformerbased vision tasks to demonstrate the superiority of the proposed Shift Add Vi T, including 2D image classification on Image Net dataset [15] with 1.2 million training and 50K validation images and 3D novel view synthesis (NVS) task on Local Light Field Fusion (LLFF) dataset [40] with eight scenes. |
| Dataset Splits | Yes | including 2D image classification on Image Net dataset [15] with 1.2 million training and 50K validation images |
| Hardware Specification | Yes | All experiments are run on a server with eight RTX A5000 GPUs with each having 24GB GPU memory. |
| Software Dependencies | No | The paper mentions Py Torch [42] and TVM [10] as software used for implementation and optimization, but specific version numbers for these or other software dependencies are not provided. |
| Experiment Setup | Yes | For the classification task, we follow Ecoformer [34] to initialize the pre-trained Vi Ts with Multihead Self-Attention (MSA) weights, based on which we apply our reparameterization a two-stage finetuning: (1) convert MSA to linear attention [73] and reparameterize all Mat Muls with add layers with 100 epoch finetuning, and (2) reparameterize MLPs or linear layers with shift or Mo E layers after finetuning another 100 epoch. ... All experiments are run on a server with eight RTX A5000 GPUs with a total batch size of 256, and we use the Adam W optimizer [37] with a cosine decay lr scheduler. ... For the NVS task, we still follow the two-stage finetuning but do not convert MSA weights to linear attention to maintain the accuracy. ... Both stages are finetuned for 140K steps with a base lr of 5 10 4, and we sample 2048 rays with 192 coarse points sampled per ray in each iteration. All other hyperparameters are the same as those in GNT [53], including the use of the Adam optimizer with an exponential decay lr scheduler. |