Effective Model Sparsification by Scheduled Grow-and-Prune Methods

Authors: Xiaolong Ma, Minghai Qin, Fei Sun, Zejiang Hou, Kun Yuan, Yi Xu, Yanzhi Wang, Yen-Kuang Chen, Rong Jin, Yuan Xie

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that the models pruned using the proposed methods match or beat the quality of the highly optimized dense models at 80% sparsity on a variety of tasks, such as image classification, objective detection, 3D object part segmentation, and translation. They also outperform other state-of-the-art (SOTA) methods for model sparsification.
Researcher Affiliation Collaboration 1 Northeastern University 2 DAMO Academy, Alibaba Group 3 Princeton University 4 Dalian University of Technology
Pseudocode Yes Algorithm 1: C-Ga P training flow. Input: An L-layer model with uninitialized weight Θ; pruning ratio r. Output: An L-layer sparse model satisfying the target sparsity requirement. Algorithm 2: P-Ga P training flow.
Open Source Code Yes Code available at: https://github.com/boone891214/Ga P.
Open Datasets Yes Image Net: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248 255, 2009. ... COCO-2017 ... Shape Net (Yi et al., 2016) ... WMT-14 En-De dataset
Dataset Splits Yes The learning rate is scheduled with a linear warm-up for 2 epochs before reaching the initial learning rate of 2.048. Each Ga P step with non-uniform and uniform sparsity includes 30 of training, respectively. The final fine-tuning includes 150 epochs. For C-Ga P, we train for 28 steps (i.e., 7 rounds with 4 partitions), and for P-Ga P, we train for 32 steps (i.e., 8 rounds with 4 partitions). After the Ga P step, we prune the dense partition(s) and fine-tune the model. For the prune-from-dense method, we use iterative ADMM pruning. We first pretrain a dense model using 250 epochs, and perform ADMM regularization training for 250 epochs. Then we prune the model to 80% sparsity and finetune for another 250 epochs. Thus, the total number of epochs for 80% sparsity is 750 epochs.
Hardware Specification Yes All of our experimental results are trained and inferenced using Py Torch in the machines with 8 NVIDIA-V100 GPUs.
Software Dependencies No The paper mentions 'Py Torch' but does not specify a version number. It also references 'NVIDIA (2020a)' for training scripts and hyperparameters, implying specific software, but without version details.
Experiment Setup Yes In the image classification task, we use standard data augmentation, a batch size of 2048, cosine annealing learning rate schedule, SGD optimizer with a momentum of 0.875, and a weight decay of 3.05e-5. The learning rate is scheduled with a linear warm-up for 2 epochs before reaching the initial learning rate of 2.048. Each Ga P step with non-uniform and uniform sparsity includes 30 of training, respectively. The final fine-tuning includes 150 epochs.