Learning Pruning-Friendly Networks via Frank-Wolfe: One-Shot, Any-Sparsity, And No Retraining

Authors: Miao Lu, Xiaolong Luo, Tianlong Chen, Wuyang Chen, Dong Liu, Zhangyang Wang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on CIFAR-10 and Tiny-Image Net datasets demonstrate that our new framework named SFWpruning consistently achieves the state-of-the-art performance on various benchmark DNNs over a wide range of pruning ratios.
Researcher Affiliation Academia 1University of Science and Technology of China, 2University of Texas at Austin
Pseudocode Yes Algorithm 1: Stochastic Frank-Wolfe with Momentum for Deep Neural Network Training; Algorithm 2: Stochastic Frank-Wolfe Pruning Framework (SFW-Pruning); Algorithm 3: Stochastic Frank-Wolfe Initialization Scheme (SFW-Init)
Open Source Code Yes Codes are available in https://github.com/VITA-Group/SFW-Once-for-All-Pruning.
Open Datasets Yes We conduct experiments via two popular architectures, Res Net-18 (He et al., 2016) and VGG-16 (Simonyan & Zisserman, 2014), on two benchmark datasets, CIFAR-10 (Krizhevsky et al., 2009) and Tiny-Image Net (Wu et al., 2017).
Dataset Splits No The paper mentions training and testing but does not explicitly provide details about a separate validation split (e.g., percentages or sample counts). While it discusses learning rate adjustments based on 5-epoch and 10-epoch average loss, it does not specify a distinct validation dataset split for this.
Hardware Specification No The paper does not specify the hardware used for experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We summarize the key experiment setups, with hyperparameters of the implementation presented in Appendix A in detail. [...] Initial learning rate α0 1.0; Training batchsize 128; Test batchsize 100; Radius τ 15; K-frac {Kl}L l=1 5%; Training epoch T 180; Momentum ρ 0.9 (Table 2); Learning rate κ 0.001; Training iterations T 390; Minimal scaling ϵ, ε 0.01 (Table 3). [...] We decrease the learning rate by 10 at epoch 61 and 121. Also, we dynamically change the learning rate (Pokutta et al., 2020): the learning rate is multiplied by 0.7 if the 5-epoch average loss is greater than the 10-epoch average loss, and is increased by a factor 1.06 if the opposite holds.