Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Authors: Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, Xing Sun2964-2972

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrated that our method significantly reduces the computational cost of vision transformers while maintaining comparable performance on image classification. For example, our method accelerates Dei T-S by over 60% throughput while only sacrificing 0.4% top-1 accuracy on Image Net-1K, outperforming current token pruning methods on both accuracy and efficiency.
Researcher Affiliation Collaboration Yifan Xu1,3,4*, Zhijie Zhang2,3 , Mengdan Zhang3, Kekai Sheng3, Ke Li3, Weiming Dong1,4 , Liqing Zhang2, Changsheng Xu1,4, Xing Sun3 1NLPR, Institute of Automation, Chinese Academy of Sciences 2Shanghai Jiao Tong University 3Tencent Youtu Lab 4School of Artificial Intelligence, University of Chinese Academy of Sciences
Pseudocode No The paper describes its methods using text and mathematical equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes Code is available at https://github.com/Yifan Xu74/Evo-Vi T.
Open Datasets Yes We demonstrate the superiority of the proposed Evo-Vi T approach through extensive experiments on the Image Net-1k (Deng et al. 2009) classification dataset.
Dataset Splits Yes We demonstrate the superiority of the proposed Evo-Vi T approach through extensive experiments on the Image Net-1k (Deng et al. 2009) classification dataset. For fair comparisons, all the models are trained for 300 epochs. The standard ImageNet splits are implicitly used given the reference to a well-established benchmark dataset and the comparative nature of the experiments.
Hardware Specification Yes The throughput is measured on a single NVIDIA V100 GPU with batch size fixed to 256, which is the same as the setting of Dei T.
Software Dependencies No The paper refers to various models and frameworks (e.g., Dei T, Le Vi T) but does not provide specific software dependencies (e.g., programming language versions, library names with version numbers like PyTorch 1.9, CUDA 11.1).
Experiment Setup Yes For overall comparisons with the state-of-the-art methods..., we conduct the token selection and slow-fast token updating from the fifth layer of Dei T and the third layer (excluding the convolution layers) of Le Vi T, respectively. The selection ratios of informative tokens in all selected layers of both Dei T and Le Vi T are set to 0.5. The global CLS attention trade-off α in Eqn. 4 are set to 0.5 for all layers. For fair comparisons, all the models are trained for 300 epochs. Specifically, we conduct the token selection and slow-fast token updating layer by layer at the first 200 training epochs. During the remaining 100 epochs, we only conduct token selection at the beginning of each stage, and then slow-fast updating is normally performed in each layer. For transformers with flat structure such as Dei T, we manually arrange four layers as one stage. Assisted CLS token loss: we calculate classification losses based on the CLS token together with the final average pooled features during training. Mathematically, ˆycls, ˆy = m(xcls, xpatch), L = ϕ(ˆycls, y) + ϕ(Avg(ˆy), y), where y is the ground-truth of xcls and xpatch; m denotes the transformer model; ϕ is the classification metric function, usually realized by the cross-entropy loss.