Data-independent Module-aware Pruning for Hierarchical Vision Transformers
Authors: Yang He, Joey Tianyi Zhou
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method validates its usefulness and strengths on Swin Transformers of different sizes on Image Net-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model. Code is available at: https://github.com/he-y/Data-independent-Module-Aware-Pruning. |
| Researcher Affiliation | Academia | Yang He, Joey Tianyi Zhou CFAR, Agency for Science, Technology and Research, Singapore IHPC, Agency for Science, Technology and Research, Singapore {He Yang, Joey Zhou}@cfar.a-star.edu.sg |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/he-y/Data-independent-Module-Aware-Pruning. |
| Open Datasets | Yes | We follow previous works Chen et al. (2021c); Yu et al. (2021); Chen et al. (2021b) by validating our method on the Image Net-1K Russakovsky et al. (2015) benchmark dataset. |
| Dataset Splits | Yes | Image Net-1K contains 1.28 million training images and 50k validation images of 1, 000 classes. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions optimizers and data augmentation techniques but does not specify version numbers for any software dependencies or libraries (e.g., 'Adam W optimizer' but no PyTorch/TensorFlow version). |
| Experiment Setup | Yes | We utilize the same training schedules as Liu et al. (2021) when training. Specifically, we employ an Adam W optimizer for 300 epochs using a cosine decay learning rate scheduler and 20 epochs of linear warm-up. The initial learning rate is 0.001, the weight decay is 0.05, and the batch size is 1024. Our data augmentation strategies are also the same as those of Liu et al. (2021) and include color jitter, Auto Augment Cubuk et al. (2018), random erasing Zhong et al. (2020), mixup Zhang et al. (2017), and Cut Mix Yun et al. (2019). For the fine-tuning process, we use Adam W optimizer for 30 epochs using a cosine decay learning rate scheduler Loshchilov & Hutter (2017). The base learning rate is 2e-5, and the minimum learning rate is 2e-7. The number of warm-up epochs is 15, and the final learning rate of the linear warm-up process is 2e-8. The weight decay is 1e-8. |