Adaptive Sharpness-Aware Pruning for Robust Sparse Networks
Authors: Anna Bair, Hongxu Yin, Maying Shen, Pavlo Molchanov, Jose M. Alvarez
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Ada SAP improves the robust accuracy of pruned models on image classification by up to +6% on Image Net C and +4% on Image Net V2, and on object detection by +4% on a corrupted Pascal VOC dataset, over a wide range of compression ratios, pruning criteria, and network architectures, outperforming recent pruning art by large margins. |
| Researcher Affiliation | Collaboration | Anna Bair Carnegie Mellon University abair@cmu.edu Hongxu Yin, Maying Shen, Pavlo Molchanov, Jose Alvarez NVIDIA {dannyy, mshen, pmolchanov, josea}@nvidia.com |
| Pseudocode | Yes | Algorithm 1 Ada SAP Optimization Iteration Algorithm 2 Ada SAP Pruning Procedure |
| Open Source Code | No | No, the paper does not provide an explicit statement about releasing code or a link to a code repository. |
| Open Datasets | Yes | For image classification, we train on Image Net-1K (Deng et al., 2009) and additionally evaluate on Image Net-C (Hendrycks & Dietterich, 2019) and Image Net-V2 (Recht et al., 2019). For object detection, we use the Pascal VOC dataset (Everingham et al., 2009). |
| Dataset Splits | Yes | For image classification, we report the Top1 accuracy on each dataset and two robustness ratios, defined as the ratio in robust accuracy to validation accuracy: RC = acc C/accval and RV2 = acc V2/accval. |
| Hardware Specification | Yes | We perform Distributed Data Parallel training across 8 V100 GPUs with batch size 128 for all experiments. |
| Software Dependencies | No | No, the paper describes the optimizer (SGD with cosine annealing learning rate, momentum, weight decay) and certain hyperparameter values, but does not list specific software dependencies with version numbers (e.g., PyTorch version, CUDA version). |
| Experiment Setup | Yes | The base optimizer is SGD with cosine annealing learning rate with a linear warmup over 8 epochs, a largest learning rate of 1.024, momentum of 0.875, and weight decay 3.05e 05. Unless otherwise stated we use ρmin = 0.01 and ρmax = 2.0 for all experiments... We run the warm up for 10 epochs, and then we follow the same pruning schedule... We fine-tune the pruned model for another 79 epochs (to reach 90 epochs total). |