reproducibilityindex.ai

EViT: Expediting Vision Transformers via Token Reorganizations

Authors: Youwei Liang, Chongjian GE, Zhan Tong, Yibing Song, Jue Wang, Pengtao Xie

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on the standard benchmarks show the effectiveness of our method. The experimental results show our advantages. We report the main results of EVi T on Tables 2 and 3.
Researcher Affiliation	Collaboration	1UC San Diego 2The University of Hong Kong 3Tencent AI Lab
Pseudocode	Yes	Algorithm 1: PyTorch-like pseudocode of EVi T for a Vi T encoder.
Open Source Code	Yes	The code is available at https://github.com/youweiliang/evit
Open Datasets	Yes	We train all of the models on the Image Net (Deng et al., 2009) training set with approximately 1.2 million images and report the accuracy on the 50k images in the test set.
Dataset Splits	Yes	We train all of the models on the Image Net (Deng et al., 2009) training set with approximately 1.2 million images and report the accuracy on the 50k images in the test set.
Hardware Specification	Yes	We train the models with EVi T from scratch for 300 epochs on 16 NVIDIA A100 GPUs and measure the throughput of the models on a single A100 GPU with a batch size of 128 unless otherwise specified.
Software Dependencies	Yes	The multiply-accumulate computations (MACs) metric is measured by torchprofile (Liu, 2021).
Experiment Setup	Yes	We train the models with EVi T from scratch for 300 epochs on 16 NVIDIA A100 GPUs and measure the throughput of the models on a single A100 GPU with a batch size of 128 unless otherwise specified. For the training strategies and optimization methods, we simply follow those in the original papers of Dei T (Touvron et al., 2021a) and LV-Vi T (Jiang et al., 2021). By default, the token identification module is incorporated into the 4th, 7th and 10th layer of Dei T-S and Dei T-B (with 12 layers in total) and incorporated into the 5th, 9th and 13th layer of LV-Vi T-S (with 16 layers in total). Besides, we adopt a warmup strategy for attentive token identification. Specifically, the keep rate of attentive tokens is gradually reduced from 1 to the target value with a cosine schedule.