Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection

Authors: Mao Ye, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, Qiang Liu

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Practically, we improve prior arts of network pruning on learning compact neural architectures on Image Net, including Res Net, Mobilenet V2/V3, and Proxyless Net. Our theory and empirical results on Mobile Net suggest that we should fine-tune the pruned subnetworks to leverage the information from the large model...
Researcher Affiliation Collaboration 1Department of Computer Science, the University of Texas, Austin 2Department of Statistics, the University of Chicago 3Google Research.
Pseudocode Yes Algorithm 1 Layer-wise Greedy Subnetwork Selection
Open Source Code Yes Code is available at https://github.com/lushleaf/Network-Pruning-Greedy-Forward-Selection.
Open Datasets Yes We use ILSVRC2012, a subset of Image Net (Deng et al., 2009) which consists of about 1.28 million training images and 50,000 validation images with 1,000 different classes.
Dataset Splits Yes We use ILSVRC2012, a subset of Image Net (Deng et al., 2009) which consists of about 1.28 million training images and 50,000 validation images with 1,000 different classes.
Hardware Specification No The paper mentions training 'on 4 GPUs' but does not specify the make or model of the GPUs or any other hardware components.
Software Dependencies No The paper describes optimization algorithms and schedules (e.g., 'SGD optimizer with Nesterov momentum 0.9', 'cosine schedule') but does not specify software dependencies with version numbers (e.g., PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes We use the standard SGD optimizer with Nesterov momentum 0.9 and weight decay 5 10^-5. For Res Net, we use a fixed learning rate 2.5 10^-4. For the other architectures, following the original settings (Cai et al., 2019; Sandler et al., 2018), we decay learning rate using cosine schedule (Loshchilov & Hutter, 2017) starting from 0.01. We finetune subnetwork for 150 epochs with batch size 512 on 4 GPUs. We resize images to 224x224 resolution and adopt the standard data augmentation scheme (mirroring and shifting).