Fast as CHITA: Neural Network Pruning with Combinatorial Optimization

Authors: Riade Benbaki, Wenyu Chen, Xiang Meng, Hussein Hazimeh, Natalia Ponomareva, Zhe Zhao, Rahul Mazumder

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On a standard benchmark of pretrained models and datasets, CHITA leads to superior sparsity-accuracy tradeoffs than competing methods. For example, for MLPNet with only 2% of the weights retained, our approach improves the accuracy by 63% relative to the state of the art. Furthermore, when used in conjunction with finetuning SGD steps, our method achieves significant accuracy gains over state-of-the-art approaches. Our code is publicly available at: https:// github.com/mazumder-lab/CHITA. ... 4. Experimental Results We compare our proposed framework with existing approaches, for both one-shot and gradual pruning.
Researcher Affiliation Collaboration 1MIT 2Google Research. Correspondence to: Riade Benbaki <rbenbaki@mit.edu>, Wenyu Chen <wenyu@mit.edu>, Xiang Meng <mengx@mit.edu>, Hussein Hazimeh <hazimeh@google.com>, Natalia Ponomareva <nponomareva@google.com>, Zhe Zhao <zhezhao@google.com>, Rahul Mazumder <rahulmaz@mit.edu>.
Pseudocode Yes Algorithm 1 CHITA++: a multi-stage pruning procedure. Algorithm 2 A novel scheme to determine the stepsize τ s. Algorithm 3 IHT with CD: IHT-CD(w0, k, t HT, t CD). Algorithm 4 Active set with IHT: CHITA-CD(w0, k, t HT, t CD, A0). Algorithm 5 Backsolve: CHITA-BSO(w0, k, t HT). Algorithm 6 CHIAT-BSO with block approximation.
Open Source Code Yes Our code is publicly available at: https:// github.com/mazumder-lab/CHITA.
Open Datasets Yes We use the same experimental setup as in recent work (Yu et al., 2022; Singh & Alistarh, 2020). The existing pruning methods we consider include MP (Magnitude Pruning, Mozer & Smolensky, 1989), WF (Wood Fisher, Singh & Alistarh, 2020), CBS (Combinatorial Brain Surgeon, Yu et al., 2022) and M-FAC (Matrix-Free Approximate Curvature, Frantar et al., 2021). The pre-trained networks we use are MLPNet (30K parameters) trained on MNIST (Le Cun et al., 1998), Res Net20 (He et al., 2016, 200k parameters) trained on CIFAR10 (Krizhevsky et al., 2009), and Mobile Net (4.2M parameters) and Res Net50 (He et al., 2016, 22M parameters) trained on Image Net (Deng et al., 2009).
Dataset Splits No The paper uses well-known public datasets (MNIST, CIFAR10, ImageNet) and states it uses the 'same experimental setup as in recent work (Yu et al., 2022; Singh & Alistarh, 2020)'. However, it does not explicitly provide the specific training, validation, or test split percentages or sample counts within the text of this paper.
Hardware Specification Yes All experiments were carried on a computing cluster. Experiments for MLPNet and Res Net20 were run on an Intel Xeon Platinum 8260 machine with 20 CPUs; experiments for Mobile Net V1 and Res Net50 were run on an Intel Xeon Gold 6248 machine with 40 CPUs and one GPU. ... Experiments for Mobile Net V1 were run on an Intel Xeon Platinum 6248 machine with 30 CPUs and 2 GPUs; experiments for Res Net50 were run on five Intel Xeon Platinum 6248 machines with 200 CPUs and 10 GPUs.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes For each network and each sparsity level, we run our proposed methods CHITA (single-stage) and CHITA++ (multi-stage) with ridge value λ ranging from [10 5, 103] and the number of IHT iterations (if Algorithm 4 is applied) ranging from [5, 500]. ... We display in Table 4 the Fisher sample size n and the mini-batch size m (also called fisher batch size) used for gradient evaluation. ... We incorporate SGD with a momentum of 0.9 for 12 epochs between pruning steps. ... We utilize distributed training and set the batch size to 256 per GPU during the SGD training process. ... We implement a cosine-based learning rate schedule similar to the one used in the STR method (Kusupati et al., 2020). Specifically, the learning rate for each epoch e between two pruning steps that occur at epochs e1 and e2 is defined as: 0.00001 + 0.5 (0.1 0.00001) [1 + cos(π(e e1)/(e2 e1))].