Advancing Model Pruning via Bi-level Optimization
Authors: Yihua Zhang, Yuguang Yao, Parikshit Ram, Pu Zhao, Tianlong Chen, Mingyi Hong, Yanzhi Wang, Sijia Liu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on both structured and unstructured pruning with 5 model architectures and 4 data sets, we demonstrate that BIP can find better winning tickets than IMP in most cases, and is computationally as efficient as the one-shot pruning schemes, demonstrating 2-7 speedup over IMP for the same level of model accuracy and sparsity. |
| Researcher Affiliation | Collaboration | 1Michigan State University, 2 IBM Research, 3Northeastern University, 4University of Texas at Austin, 5University of Minnesota, Twin Cities |
| Pseudocode | Yes | In Fig. A1, we highlight the algorithmic details on the BIP pipeline. We present more implementation details of BIP below and refer readers to Appendix B for a detailed algorithm description. |
| Open Source Code | Yes | Codes are available at https://github.com/OPTML-Group/BiP. |
| Open Datasets | Yes | Following the pruning benchmark in [22], we consider 4 datasets including CIFAR-10 [102], CIFAR-100 [102], Tiny-Image Net [103], Image Net [104], and 5 architecture types including Res Net-20/56/18/50 and VGG-16 [105, 106]. |
| Dataset Splits | Yes | The solid line and shaded area of each pruning method represent the mean and variance of test accuracies over 3 independent trials. |
| Hardware Specification | Yes | GPU Model(s): NVIDIA A6000 |
| Software Dependencies | Yes | software environment: Python (3.8.12), Pytorch (1.10.0), Torchvision (0.11.1), CUDA (11.3), CUDNN (8.2.1), Torch-scatter (2.0.9), Torch-sparse (0.6.12), Numpy (1.21.5) |
| Experiment Setup | Yes | Hyperparameter tuning: As described in (θ-step)-(m-step), BIP needs to set two learning rates α and β for lower-level and upper-level optimization, respectively. We choose α = 0.01 and β = 0.1 in all experiments, where we adopt the mask learning rate β from Hydra [9] and set a smaller lower-level learning rate α, as θ is initialized by a pre-trained dense model. We show ablation study on α in Fig. A8(c). BLO also brings in the low-level convexification parameter γ. We set γ = 1.0 in experiments and refer readers to Fig. A8(b) for a sanity check. |