CrAM: A Compression-Aware Minimizer

Authors: Alexandra Peste, Adrian Vladu, Eldar Kurtic, Christoph H Lampert, Dan Alistarh

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on standard benchmarks, such as residual networks for Image Net classification and BERT models for language modelling, show that Cr AM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning: specifically, we can prune models in one-shot to 70-80% sparsity with almost no accuracy loss, and to 90% with reasonable ( 1%) accuracy loss, which is competitive with gradual compression methods.
Researcher Affiliation Collaboration Alexandra Peste1 Adrian Vladu2 Eldar Kurtic1 Christoph H. Lampert1 Dan Alistarh1,3 1Institute of Science and Technology Austria (ISTA) 2 CNRS & IRIF 3 Neural Magic, Inc.
Pseudocode Yes Algorithm 1 Compression-Aware Minimization (Cr AM / Cr AM+)
Open Source Code Yes The code for reproducing the results is available at: https://github.com/IST-DASLab/Cr AM.
Open Datasets Yes Our experimental validation mainly focuses on sparsity, obtained by applying the Top-K operator, in the context of Cr AM (i.e. Top K-Cr AM). The main method we propose is Cr AM+ with multiple sparsity levels chosen uniformly at random at each step (Cr AM+-Multi). We also experiment with particular cases of this method, where only one sparsity level is used (e.g. Cr AM+-k70), and also with the initial Cr AM method with low sparsity (e.g. Cr AM-k50). For image classification experiments, all one-shot pruning results are presented after BNT on a subset of 1000 training samples, i.e. 100 inference steps on batches of size 128, using standard random augmentations. 4.1 IMAGENET EXPERIMENTS General Setup. We use a standard setup for training our Image Net/Res Net50 models, similar to Foret et al. (2021), which we describe in Appendix B. To match the number of backpropagation steps of Cr AM, we additionally train the dense baseline for twice as many epochs. We have found that ρ = 0.05 recommended by the authors of SAM (Foret et al., 2021) is a good value for Cr AM, and we have kept it for all our Image Net experiments. As stated, after one-shot pruning, we perform BNT on a subset of 1000 training samples (e.g. one per class), with standard augmentations. We show in the Appendix C.3 that the accuracy after BNT is extremely stable, w.r.t. the choice of calibration set.
Dataset Splits Yes To determine the value of the hyperparameter ρ, we performed a grid search over values in the range 0.01 0.2, using a 90% 10% train-validation split and found 0.1 and 0.15 to be the best values for SAM and Cr AM+Multi, respectively (achieving highest validation accuracy).
Hardware Specification Yes Table 20: (SQu ADv1.1/BERT-base) Speed-ups of pruned BERT-base models relative to the dense model, benchmarked with the sparsity-aware inference engine Deep Sparse (version 1.0.2) (Kurtz et al., 2020; Neural Magic, 2021) in two different scenarios on AMD EPYC 7702 64-Core Processor.
Software Dependencies Yes Table 20: (SQu ADv1.1/BERT-base) Speed-ups of pruned BERT-base models relative to the dense model, benchmarked with the sparsity-aware inference engine Deep Sparse (version 1.0.2) (Kurtz et al., 2020; Neural Magic, 2021) in two different scenarios on AMD EPYC 7702 64-Core Processor.
Experiment Setup Yes Hyperparameters for Image Net experiments For our Image Net experiments, we use standard data augmentation, and we train the models using SGD for 100 epochs, with batch size 512, momentum 0.9, and weight decay 0.0001. The learning rate is linearly increased for the first 5 epochs until it reaches a maximum value of 0.2, after which it is decreased at each epoch, using a cosine scheduler.