Width & Depth Pruning for Vision Transformers

Authors: Fang Yu, Kun Huang, Meng Wang, Yuan Cheng, Wei Chu, Li Cui3143-3151

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of mainstream vision transformers such as Dei T and Swin Transformer with a minor accuracy drop. In particular, on ILSVRC-12, we achieve over 22% pruning ratio of FLOPs by compressing Dei T-Base, even with an increase of 0.14% Top-1 accuracy.
Researcher Affiliation Collaboration Fang Yu1,2, Kun Huang3, Meng Wang3, Yuan Cheng3*, Wei Chu3, Li Cui1 1Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Science 3Ant Financial Services Group {yufang,lcui}@ict.ac.cn, {hunterkun.hk,darren.wm,chengyuan.c,weichu.cw}@antgroup.com
Pseudocode Yes The detailed procedure is presented in Algorithm 1.
Open Source Code No The paper does not contain an explicit statement about making the source code for their methodology available, nor does it provide a direct link to a code repository.
Open Datasets Yes Datasets CIFAR-10 contains 50k training images and 10k validating images, which are categorized into 10 classes for image classification. Compared with CIFAR-10, ILSVRC-12 is a larger scale image classification dataset, which comprises 1.28 million images from 1k categories for training and 50k images for validation.
Dataset Splits Yes Datasets CIFAR-10 contains 50k training images and 10k validating images, which are categorized into 10 classes for image classification. Compared with CIFAR-10, ILSVRC-12 is a larger scale image classification dataset, which comprises 1.28 million images from 1k categories for training and 50k images for validation.
Hardware Specification Yes The GPU throughout is obtained by measuring the forward time on a NVIDIA RTX 3090 GPU with a batchsize of 1024, and the latency on CPU is measured on AMD EPYC 7502 32-Core CPU with a batchsize of 1.
Software Dependencies No The paper mentions software like "Adam W optimizer", "Tensor RT", and "ONNX", but does not provide specific version numbers for any of these dependencies.
Experiment Setup Yes The initial learning rate is 0.0005. We use Adam W optimizer with a momentum of 0.9 for optimization. We set the weight decay to 0.05. [...] The learning rates of saliency scores and threshold parameters are set by 0.025 initially, and they are finetuned with Adam W with cosine learning rate decay strategy.