EViT: Expediting Vision Transformers via Token Reorganizations
Authors: Youwei Liang, Chongjian GE, Zhan Tong, Yibing Song, Jue Wang, Pengtao Xie
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the standard benchmarks show the effectiveness of our method. The experimental results show our advantages. We report the main results of EVi T on Tables 2 and 3. |
| Researcher Affiliation | Collaboration | 1UC San Diego 2The University of Hong Kong 3Tencent AI Lab |
| Pseudocode | Yes | Algorithm 1: PyTorch-like pseudocode of EVi T for a Vi T encoder. |
| Open Source Code | Yes | The code is available at https://github.com/youweiliang/evit |
| Open Datasets | Yes | We train all of the models on the Image Net (Deng et al., 2009) training set with approximately 1.2 million images and report the accuracy on the 50k images in the test set. |
| Dataset Splits | Yes | We train all of the models on the Image Net (Deng et al., 2009) training set with approximately 1.2 million images and report the accuracy on the 50k images in the test set. |
| Hardware Specification | Yes | We train the models with EVi T from scratch for 300 epochs on 16 NVIDIA A100 GPUs and measure the throughput of the models on a single A100 GPU with a batch size of 128 unless otherwise specified. |
| Software Dependencies | Yes | The multiply-accumulate computations (MACs) metric is measured by torchprofile (Liu, 2021). |
| Experiment Setup | Yes | We train the models with EVi T from scratch for 300 epochs on 16 NVIDIA A100 GPUs and measure the throughput of the models on a single A100 GPU with a batch size of 128 unless otherwise specified. For the training strategies and optimization methods, we simply follow those in the original papers of Dei T (Touvron et al., 2021a) and LV-Vi T (Jiang et al., 2021). By default, the token identification module is incorporated into the 4th, 7th and 10th layer of Dei T-S and Dei T-B (with 12 layers in total) and incorporated into the 5th, 9th and 13th layer of LV-Vi T-S (with 16 layers in total). Besides, we adopt a warmup strategy for attentive token identification. Specifically, the keep rate of attentive tokens is gradually reduced from 1 to the target value with a cosine schedule. |