reproducibilityindex.ai

Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Authors: Yang He, Joey Tianyi Zhou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method validates its usefulness and strengths on Swin Transformers of different sizes on Image Net-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model. Code is available at: https://github.com/he-y/Data-independent-Module-Aware-Pruning.
Researcher Affiliation	Academia	Yang He, Joey Tianyi Zhou CFAR, Agency for Science, Technology and Research, Singapore IHPC, Agency for Science, Technology and Research, Singapore {He Yang, Joey Zhou}@cfar.a-star.edu.sg
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at: https://github.com/he-y/Data-independent-Module-Aware-Pruning.
Open Datasets	Yes	We follow previous works Chen et al. (2021c); Yu et al. (2021); Chen et al. (2021b) by validating our method on the Image Net-1K Russakovsky et al. (2015) benchmark dataset.
Dataset Splits	Yes	Image Net-1K contains 1.28 million training images and 50k validation images of 1, 000 classes.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions optimizers and data augmentation techniques but does not specify version numbers for any software dependencies or libraries (e.g., 'Adam W optimizer' but no PyTorch/TensorFlow version).
Experiment Setup	Yes	We utilize the same training schedules as Liu et al. (2021) when training. Specifically, we employ an Adam W optimizer for 300 epochs using a cosine decay learning rate scheduler and 20 epochs of linear warm-up. The initial learning rate is 0.001, the weight decay is 0.05, and the batch size is 1024. Our data augmentation strategies are also the same as those of Liu et al. (2021) and include color jitter, Auto Augment Cubuk et al. (2018), random erasing Zhong et al. (2020), mixup Zhang et al. (2017), and Cut Mix Yun et al. (2019). For the fine-tuning process, we use Adam W optimizer for 30 epochs using a cosine decay learning rate scheduler Loshchilov & Hutter (2017). The base learning rate is 2e-5, and the minimum learning rate is 2e-7. The number of warm-up epochs is 15, and the final learning rate of the linear warm-up process is 2e-8. The weight decay is 1e-8.