Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Authors: Xiangcheng Liu, Tianyi Wu, Guodong Guo

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness of our approach. Our method improves the throughput of Dei T-S by 50% and brings only 0.2% drop in top-1 accuracy, which achieves a better trade-off between accuracy and latency than the previous methods.
Researcher Affiliation Collaboration Xiangcheng Liu1 , Tianyi Wu2* and Guodong Guo3 1Peking University 2Baidu Autonomous Driving Technology Department (ADT) 3Institute of Deep Learning, Baidu Research liuxiangcheng@stu.pku.edu.cn, wutianyi01@baidu.com, Guodong.Guo@mail.wvu.edu
Pseudocode No The paper describes the methods using text and mathematical equations, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement or link for open-source code.
Open Datasets Yes Our experiments are conducted on the Image Net-1K [Deng et al., 2009] classification dataset
Dataset Splits No The paper states: "We finetune the pre-trained model by 30 epoch to obtain the compressed network, and most of the training settings stay the same as the originals." and mentions "Image Net validation dataset images", but does not provide specific split percentages or sample counts for training, validation, or test sets.
Hardware Specification Yes The throughput metric is measured on a single NVIDIA 2080Ti GPU using a fixed batch size 64 and hardware latency is the average elapsed time of 100 inferences with a single image on the same machine.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes We finetune the pre-trained model by 30 epoch to obtain the compressed network, and most of the training settings stay the same as the originals. The overall training objective is a combination of the above three components: L = LCE + λ1LFLOPs + λ2Ldistill, where λ is used to control the loss balance, and we set λ1 = 2, λ2 = 0.5 in our experiments.