Layer-adaptive Sparsity for the Magnitude-based Pruning

Authors: Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, Jinwoo Shin

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of LAMP under a diverse experimental setup, encompassing various convolutional neural network architectures (VGG-16, Res Net-18/34, Dense Net-121, Efficient Net-B0) and various image datasets (CIFAR-10/100, SVHN, Restricted Image Net). In all considered setups, LAMP consistently outperforms the baseline layerwise sparsity selection schemes.
Researcher Affiliation Academia Jaeho Lee E Sejun Park A Sangwoo Mo E Sungsoo Ahn M Jinwoo ShinÆ EKAIST EE AKAIST AI MMBZUAI {jaeho-lee,sejun.park,swmo,jinwoos}@kaist.ac.kr, peter.ahn@mbzuai.ac.ae
Pseudocode Yes The first three steps can be easily implemented in Py Torch as follows. def lamp_score(weight): normalizer = weight.norm() ** 2 sorted_weight, sorted_idx = weight.abs().view(-1).sort(descending=False) weight_cumsum_temp = (sorted_weight ** 2).cumsum(dim=0) weight_cumsum = torch.zeros(weight_cumsum_temp.shape) weight_cumsum[1:] = weight_cumsum_temp[:len(weight_cumsum_temp) 1] sorted_weight /= (normalizer weight_square_cumsum).sqrt() score = torch.zeros(weight_cumsum.shape) score[sorted_idx] = sorted_weight score = score.view(weight.shape) return score
Open Source Code Yes Code: https://github.com/jaeho-lee/layer-adaptive-sparsity
Open Datasets Yes Datasets. We consider the following datasets; CIFAR-10/100 (Krizhevsky & Hinton, 2009), SVHN (Netzer et al., 2011), and Restricted Image Net (Tsipras et al., 2019).
Dataset Splits No CIFAR-10/100 dataset is augmented with random crops with a padding of 4 and random horizontal flips. We normalize both training and test datasets with constants (0.4914, 0.4822, 0.4465), (0.237, 0.243, 0.261). The paper mentions "training" and "test" datasets but does not provide specific split percentages, counts, or explicit cross-validation details for reproducibility.
Hardware Specification No The paper mentions 'Sparse GPU kernels for deep learning' in reference to another work (Gale et al., 2020), but does not specify the hardware (e.g., GPU models, CPU types) used for its own experiments.
Software Dependencies No With an exception of the weight rewinding experiment, we use Adam W (Loshchilov & Hutter, 2019) with learning rate 0.0003; we use vanilla Adam with learning rate 0.0003 for the weight rewinding experiment, following the setup of Frankle & Carbin (2019). For other hyperparameters, we follow the Py Torch default setup: β = (0.9, 0.999), wd = 0.01, ε = 10 8. The paper mentions software components like PyTorch and Adam W but does not provide specific version numbers for any of them.
Experiment Setup Yes A EXPERIMENTAL SETUPS For any implementational details not given in this section, we refer to the code at: https://github.com/jaeho-lee/layer-adaptive-sparsity Optimizer. With an exception of the weight rewinding experiment, we use Adam W (Loshchilov & Hutter, 2019) with learning rate 0.0003; we use vanilla Adam with learning rate 0.0003 for the weight rewinding experiment, following the setup of Frankle & Carbin (2019). For other hyperparameters, we follow the Py Torch default setup: β = (0.9, 0.999), wd = 0.01, ε = 10 8. Pre-processing. CIFAR-10/100 dataset is augmented with random crops with a padding of 4 and random horizontal flips. We normalize both training and test datasets with constants (0.4914, 0.4822, 0.4465), (0.237, 0.243, 0.261). ... Table 1: Optimization details. Dataset Model Initial training iter. Re-training iter. Batch size SVHN VGG-16 40000 30000 100 CIFAR-10 {VGG-16, Efficient Net-B0} 50000 40000 100 CIFAR-10 Dense Net-121 80000 60000 100 CIFAR-100 VGG-16 60000 50000 100 Restricted Image Net Res Net-34 80000 80000 128 CIFAR-10 Conv-6 (SNIP) 50000 40000 128