Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection
Authors: Mao Ye, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, Qiang Liu
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Practically, we improve prior arts of network pruning on learning compact neural architectures on Image Net, including Res Net, Mobilenet V2/V3, and Proxyless Net. Our theory and empirical results on Mobile Net suggest that we should fine-tune the pruned subnetworks to leverage the information from the large model... |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, the University of Texas, Austin 2Department of Statistics, the University of Chicago 3Google Research. |
| Pseudocode | Yes | Algorithm 1 Layer-wise Greedy Subnetwork Selection |
| Open Source Code | Yes | Code is available at https://github.com/lushleaf/Network-Pruning-Greedy-Forward-Selection. |
| Open Datasets | Yes | We use ILSVRC2012, a subset of Image Net (Deng et al., 2009) which consists of about 1.28 million training images and 50,000 validation images with 1,000 different classes. |
| Dataset Splits | Yes | We use ILSVRC2012, a subset of Image Net (Deng et al., 2009) which consists of about 1.28 million training images and 50,000 validation images with 1,000 different classes. |
| Hardware Specification | No | The paper mentions training 'on 4 GPUs' but does not specify the make or model of the GPUs or any other hardware components. |
| Software Dependencies | No | The paper describes optimization algorithms and schedules (e.g., 'SGD optimizer with Nesterov momentum 0.9', 'cosine schedule') but does not specify software dependencies with version numbers (e.g., PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | We use the standard SGD optimizer with Nesterov momentum 0.9 and weight decay 5 10^-5. For Res Net, we use a fixed learning rate 2.5 10^-4. For the other architectures, following the original settings (Cai et al., 2019; Sandler et al., 2018), we decay learning rate using cosine schedule (Loshchilov & Hutter, 2017) starting from 0.01. We finetune subnetwork for 150 epochs with batch size 512 on 4 GPUs. We resize images to 224x224 resolution and adopt the standard data augmentation scheme (mirroring and shifting). |