Effective Model Sparsification by Scheduled Grow-and-Prune Methods
Authors: Xiaolong Ma, Minghai Qin, Fei Sun, Zejiang Hou, Kun Yuan, Yi Xu, Yanzhi Wang, Yen-Kuang Chen, Rong Jin, Yuan Xie
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that the models pruned using the proposed methods match or beat the quality of the highly optimized dense models at 80% sparsity on a variety of tasks, such as image classification, objective detection, 3D object part segmentation, and translation. They also outperform other state-of-the-art (SOTA) methods for model sparsification. |
| Researcher Affiliation | Collaboration | 1 Northeastern University 2 DAMO Academy, Alibaba Group 3 Princeton University 4 Dalian University of Technology |
| Pseudocode | Yes | Algorithm 1: C-Ga P training flow. Input: An L-layer model with uninitialized weight Θ; pruning ratio r. Output: An L-layer sparse model satisfying the target sparsity requirement. Algorithm 2: P-Ga P training flow. |
| Open Source Code | Yes | Code available at: https://github.com/boone891214/Ga P. |
| Open Datasets | Yes | Image Net: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248 255, 2009. ... COCO-2017 ... Shape Net (Yi et al., 2016) ... WMT-14 En-De dataset |
| Dataset Splits | Yes | The learning rate is scheduled with a linear warm-up for 2 epochs before reaching the initial learning rate of 2.048. Each Ga P step with non-uniform and uniform sparsity includes 30 of training, respectively. The final fine-tuning includes 150 epochs. For C-Ga P, we train for 28 steps (i.e., 7 rounds with 4 partitions), and for P-Ga P, we train for 32 steps (i.e., 8 rounds with 4 partitions). After the Ga P step, we prune the dense partition(s) and fine-tune the model. For the prune-from-dense method, we use iterative ADMM pruning. We first pretrain a dense model using 250 epochs, and perform ADMM regularization training for 250 epochs. Then we prune the model to 80% sparsity and finetune for another 250 epochs. Thus, the total number of epochs for 80% sparsity is 750 epochs. |
| Hardware Specification | Yes | All of our experimental results are trained and inferenced using Py Torch in the machines with 8 NVIDIA-V100 GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not specify a version number. It also references 'NVIDIA (2020a)' for training scripts and hyperparameters, implying specific software, but without version details. |
| Experiment Setup | Yes | In the image classification task, we use standard data augmentation, a batch size of 2048, cosine annealing learning rate schedule, SGD optimizer with a momentum of 0.875, and a weight decay of 3.05e-5. The learning rate is scheduled with a linear warm-up for 2 epochs before reaching the initial learning rate of 2.048. Each Ga P step with non-uniform and uniform sparsity includes 30 of training, respectively. The final fine-tuning includes 150 epochs. |