Rethinking the Value of Network Pruning

Authors: Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, Trevor Darrell

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for multiple network architectures, datasets, and tasks... In this work, we show that both of the beliefs mentioned above are not necessarily true for structured pruning methods, which prune at the levels of convolution channels or larger. Based on an extensive empirical evaluation of state-of-the-art pruning algorithms on multiple datasets with multiple network architectures, we make two surprising observations.
Researcher Affiliation Academia Zhuang Liu1 , Mingjie Sun2 , Tinghui Zhou1, Gao Huang2, Trevor Darrell1 1University of California, Berkeley 2Tsinghua University
Pseudocode No The paper does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes For reproducing the results and a more detailed knowledge about the settings, see our code at: https://github. com/Eric-mingjie/rethinking-network-pruning.
Open Datasets Yes In the network pruning literature, CIFAR-10, CIFAR-100 (Krizhevsky, 2009), and Image Net (Deng et al., 2009) datasets are the de-facto benchmarks...
Dataset Splits Yes For CIFAR, training/fine-tuning takes 160/40 epochs. For Image Net, training/fine-tuning takes 90/20 epochs. In our experiments, we use Scratch-E to denote training the small pruned models for the same epochs, and Scratch-B to denote training for the same amount of computation budget (on Image Net, if the pruned model saves more than 2 FLOPs, we just double the number of epochs for training Scratch-B, which amounts to less computation budget than large model training).
Hardware Specification No The paper mentions "less GPU memory" and "dedicated hardware/libraries" but does not specify any particular GPU models, CPU models, or other specific hardware configurations used for running the experiments.
Software Dependencies No This could be due to the difference in the deep learning frameworks: we used Pytorch (Paszke et al., 2017) while the original papers used Caffe (Jia et al., 2014). The paper mentions software by name but does not provide specific version numbers for reproducibility.
Experiment Setup Yes In our experiments, we use Scratch-E to denote training the small pruned models for the same epochs, and Scratch-B to denote training for the same amount of computation budget... We use standard training hyper-parameters and data-augmentation schemes... The optimization method is SGD with Nesterov momentum, using an stepwise decay learning rate schedule. For random weight initialization, we adopt the scheme proposed in (He et al., 2015)... For CIFAR, training/fine-tuning takes 160/40 epochs. For Image Net, training/fine-tuning takes 90/20 epochs.