Fast as CHITA: Neural Network Pruning with Combinatorial Optimization
Authors: Riade Benbaki, Wenyu Chen, Xiang Meng, Hussein Hazimeh, Natalia Ponomareva, Zhe Zhao, Rahul Mazumder
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On a standard benchmark of pretrained models and datasets, CHITA leads to superior sparsity-accuracy tradeoffs than competing methods. For example, for MLPNet with only 2% of the weights retained, our approach improves the accuracy by 63% relative to the state of the art. Furthermore, when used in conjunction with finetuning SGD steps, our method achieves significant accuracy gains over state-of-the-art approaches. Our code is publicly available at: https:// github.com/mazumder-lab/CHITA. ... 4. Experimental Results We compare our proposed framework with existing approaches, for both one-shot and gradual pruning. |
| Researcher Affiliation | Collaboration | 1MIT 2Google Research. Correspondence to: Riade Benbaki <rbenbaki@mit.edu>, Wenyu Chen <wenyu@mit.edu>, Xiang Meng <mengx@mit.edu>, Hussein Hazimeh <hazimeh@google.com>, Natalia Ponomareva <nponomareva@google.com>, Zhe Zhao <zhezhao@google.com>, Rahul Mazumder <rahulmaz@mit.edu>. |
| Pseudocode | Yes | Algorithm 1 CHITA++: a multi-stage pruning procedure. Algorithm 2 A novel scheme to determine the stepsize τ s. Algorithm 3 IHT with CD: IHT-CD(w0, k, t HT, t CD). Algorithm 4 Active set with IHT: CHITA-CD(w0, k, t HT, t CD, A0). Algorithm 5 Backsolve: CHITA-BSO(w0, k, t HT). Algorithm 6 CHIAT-BSO with block approximation. |
| Open Source Code | Yes | Our code is publicly available at: https:// github.com/mazumder-lab/CHITA. |
| Open Datasets | Yes | We use the same experimental setup as in recent work (Yu et al., 2022; Singh & Alistarh, 2020). The existing pruning methods we consider include MP (Magnitude Pruning, Mozer & Smolensky, 1989), WF (Wood Fisher, Singh & Alistarh, 2020), CBS (Combinatorial Brain Surgeon, Yu et al., 2022) and M-FAC (Matrix-Free Approximate Curvature, Frantar et al., 2021). The pre-trained networks we use are MLPNet (30K parameters) trained on MNIST (Le Cun et al., 1998), Res Net20 (He et al., 2016, 200k parameters) trained on CIFAR10 (Krizhevsky et al., 2009), and Mobile Net (4.2M parameters) and Res Net50 (He et al., 2016, 22M parameters) trained on Image Net (Deng et al., 2009). |
| Dataset Splits | No | The paper uses well-known public datasets (MNIST, CIFAR10, ImageNet) and states it uses the 'same experimental setup as in recent work (Yu et al., 2022; Singh & Alistarh, 2020)'. However, it does not explicitly provide the specific training, validation, or test split percentages or sample counts within the text of this paper. |
| Hardware Specification | Yes | All experiments were carried on a computing cluster. Experiments for MLPNet and Res Net20 were run on an Intel Xeon Platinum 8260 machine with 20 CPUs; experiments for Mobile Net V1 and Res Net50 were run on an Intel Xeon Gold 6248 machine with 40 CPUs and one GPU. ... Experiments for Mobile Net V1 were run on an Intel Xeon Platinum 6248 machine with 30 CPUs and 2 GPUs; experiments for Res Net50 were run on five Intel Xeon Platinum 6248 machines with 200 CPUs and 10 GPUs. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | For each network and each sparsity level, we run our proposed methods CHITA (single-stage) and CHITA++ (multi-stage) with ridge value λ ranging from [10 5, 103] and the number of IHT iterations (if Algorithm 4 is applied) ranging from [5, 500]. ... We display in Table 4 the Fisher sample size n and the mini-batch size m (also called fisher batch size) used for gradient evaluation. ... We incorporate SGD with a momentum of 0.9 for 12 epochs between pruning steps. ... We utilize distributed training and set the batch size to 256 per GPU during the SGD training process. ... We implement a cosine-based learning rate schedule similar to the one used in the STR method (Kusupati et al., 2020). Specifically, the learning rate for each epoch e between two pruning steps that occur at epochs e1 and e2 is defined as: 0.00001 + 0.5 (0.1 0.00001) [1 + cos(π(e e1)/(e2 e1))]. |