Width & Depth Pruning for Vision Transformers
Authors: Fang Yu, Kun Huang, Meng Wang, Yuan Cheng, Wei Chu, Li Cui3143-3151
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of mainstream vision transformers such as Dei T and Swin Transformer with a minor accuracy drop. In particular, on ILSVRC-12, we achieve over 22% pruning ratio of FLOPs by compressing Dei T-Base, even with an increase of 0.14% Top-1 accuracy. |
| Researcher Affiliation | Collaboration | Fang Yu1,2, Kun Huang3, Meng Wang3, Yuan Cheng3*, Wei Chu3, Li Cui1 1Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Science 3Ant Financial Services Group {yufang,lcui}@ict.ac.cn, {hunterkun.hk,darren.wm,chengyuan.c,weichu.cw}@antgroup.com |
| Pseudocode | Yes | The detailed procedure is presented in Algorithm 1. |
| Open Source Code | No | The paper does not contain an explicit statement about making the source code for their methodology available, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | Datasets CIFAR-10 contains 50k training images and 10k validating images, which are categorized into 10 classes for image classification. Compared with CIFAR-10, ILSVRC-12 is a larger scale image classification dataset, which comprises 1.28 million images from 1k categories for training and 50k images for validation. |
| Dataset Splits | Yes | Datasets CIFAR-10 contains 50k training images and 10k validating images, which are categorized into 10 classes for image classification. Compared with CIFAR-10, ILSVRC-12 is a larger scale image classification dataset, which comprises 1.28 million images from 1k categories for training and 50k images for validation. |
| Hardware Specification | Yes | The GPU throughout is obtained by measuring the forward time on a NVIDIA RTX 3090 GPU with a batchsize of 1024, and the latency on CPU is measured on AMD EPYC 7502 32-Core CPU with a batchsize of 1. |
| Software Dependencies | No | The paper mentions software like "Adam W optimizer", "Tensor RT", and "ONNX", but does not provide specific version numbers for any of these dependencies. |
| Experiment Setup | Yes | The initial learning rate is 0.0005. We use Adam W optimizer with a momentum of 0.9 for optimization. We set the weight decay to 0.05. [...] The learning rates of saliency scores and threshold parameters are set by 0.025 initially, and they are finetuned with Adam W with cosine learning rate decay strategy. |