Structural Pruning via Latency-Saliency Knapsack

Authors: Maying Shen, Hongxu Yin, Pavlo Molchanov, Lei Mao, Jianna Liu, Jose M. Alvarez

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We examine HALP on both classification and detection tasks, over varying networks, on Image Net and VOC datasets, on different platforms. In particular, for Res Net-50/-101 pruning on Image Net, HALP improves network throughput by 1.60 /1.90 with +0.3%/ 0.2% top-1 accuracy changes, respectively. For SSD pruning on VOC, HALP improves throughput by 1.94 with only a 0.56 m AP drop. HALP consistently outperforms prior art, sometimes by large margins.
Researcher Affiliation Industry NVIDIA {mshen,dannyy,pmolchanov,lmao,jiannal,josea}@nvidia.com
Pseudocode Yes A description of the pseudo code of the augmented knapsack solver is provided in Algo. 1 (a detailed explanation is provided in Appendix A). The augmented solver is required to make sure that the latency cost is correct.
Open Source Code Yes Project page at https://halp-neurips.github.io/. Did you include the license to the code and datasets? [Yes] See Abstract. Did you include any new assets either in the supplemental material or as a URL? [Yes] Code will be released. The link to the code is provided in the abstract.
Open Datasets Yes We use Image Net ILSVRC2021 [53] for classification. We use the popular architecture Single Shot Detector (SSD) [35] on the PASCAL VOC dataset [13]. The Image Net [53] and PASCAL VOC [13] dataset are open source and available for non-commercial academic research.
Dataset Splits Yes We use standard ImageNet 1K training set for training. For validation, we use the 50K image validation set of ImageNet. For VOC dataset, we use the union of VOC2007 trainval and VOC2012 trainval as our training set and use the VOC2007 test as test set.
Hardware Specification Yes Target hardware is NVIDIA Titan V GPU. We apply HALP targeting latency reduction on multiple platforms to show the scalability of our method: NVIDIA TITAN V GPU, Jetson TX2, Jetson Xavier and Intel CPU. The latency on Jetson TX2 and CPU is measured using Py Torch; on Xavier is measured using Tensor RT FP32. MODEL ACC DROP TITAN V GPU RTX3080 GPU
Software Dependencies Yes The latency on Jetson TX2 and CPU is measured using Py Torch; on Xavier is measured using Tensor RT FP32. We run the inference of the model with FP32, FP16 and also INT8. For INT8, we quantize the model using entropy calibration with 2560 randomly selected Image Net training images. We also export the models into onnx format and test the inference speed with Tensor RT. Tensor RT (version 7.2.1.6)
Experiment Setup Yes We perform one pruning every r minibatches and repeat it k pruning steps in total. In particular, we set k milestones gradually decreasing the total latency to reach the goal via exponential scheduler [10], with C1 > C2 > > Ck, Ck = C. We use standard ImageNet 1K training set for training. For validation, we use the 50K image validation set of ImageNet. We follow the common practice to train the models on 8 GPUs. Batch size is 256.