Advancing Model Pruning via Bi-level Optimization

Authors: Yihua Zhang, Yuguang Yao, Parikshit Ram, Pu Zhao, Tianlong Chen, Mingyi Hong, Yanzhi Wang, Sijia Liu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on both structured and unstructured pruning with 5 model architectures and 4 data sets, we demonstrate that BIP can find better winning tickets than IMP in most cases, and is computationally as efficient as the one-shot pruning schemes, demonstrating 2-7 speedup over IMP for the same level of model accuracy and sparsity.
Researcher Affiliation Collaboration 1Michigan State University, 2 IBM Research, 3Northeastern University, 4University of Texas at Austin, 5University of Minnesota, Twin Cities
Pseudocode Yes In Fig. A1, we highlight the algorithmic details on the BIP pipeline. We present more implementation details of BIP below and refer readers to Appendix B for a detailed algorithm description.
Open Source Code Yes Codes are available at https://github.com/OPTML-Group/BiP.
Open Datasets Yes Following the pruning benchmark in [22], we consider 4 datasets including CIFAR-10 [102], CIFAR-100 [102], Tiny-Image Net [103], Image Net [104], and 5 architecture types including Res Net-20/56/18/50 and VGG-16 [105, 106].
Dataset Splits Yes The solid line and shaded area of each pruning method represent the mean and variance of test accuracies over 3 independent trials.
Hardware Specification Yes GPU Model(s): NVIDIA A6000
Software Dependencies Yes software environment: Python (3.8.12), Pytorch (1.10.0), Torchvision (0.11.1), CUDA (11.3), CUDNN (8.2.1), Torch-scatter (2.0.9), Torch-sparse (0.6.12), Numpy (1.21.5)
Experiment Setup Yes Hyperparameter tuning: As described in (θ-step)-(m-step), BIP needs to set two learning rates α and β for lower-level and upper-level optimization, respectively. We choose α = 0.01 and β = 0.1 in all experiments, where we adopt the mask learning rate β from Hydra [9] and set a smaller lower-level learning rate α, as θ is initialized by a pre-trained dense model. We show ablation study on α in Fig. A8(c). BLO also brings in the low-level convexification parameter γ. We set γ = 1.0 in experiments and refer readers to Fig. A8(b) for a sanity check.