SPDY: Accurate Pruning with Speedup Guarantees
Authors: Elias Frantar, Dan Alistarh
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across popular vision and language models show that SPDY guarantees speedups while recovering higher accuracy relative to existing strategies, both for one-shot and gradual pruning scenarios, and is compatible with most existing pruning approaches. |
| Researcher Affiliation | Collaboration | 1IST Austria 2Neural Magic. |
| Pseudocode | Yes | Algorithm 1 We efficiently compute the optimal layer-wise sparsity profile with execution time at most T given S, es ℓ, ts ℓand assuming that time is discretized, using bottom-up dynamic programming. ... Algorithm 2 Collect layer-wise timings ts ℓ. ... Algorithm 3 Generate reconstruction database entries W s ℓ. ... Algorithm 4 SPDY search for optimal sensitivity values c . |
| Open Source Code | Yes | We provide efficient implementations of our methods at https://github.com/IST-DASLab/spdy. |
| Open Datasets | Yes | For experiments on Image Net (Deng et al., 2009) we follow (Hubara et al., 2021a), by defining the calibration set for AP, g AP and the profile search to contain exactly one randomly-selected training image per class. For other tasks, we select 1000 training samples at random for the calibration set. ... YOLOv5 (Jocher, 2022.) object detector, and the widely used BERTbase for language modelling (Devlin et al., 2019) on the SQu AD dataset (Rajpurkar et al., 2016). |
| Dataset Splits | Yes | The actual quality of this profile is then determined by stitching together layers from the reconstruction database (see Section 3.4) and computing the loss of the composite model on a small calibration set. In practice, we use the same data for validation as for the AP; similar to (Hubara et al., 2021a), we do not observe any overfitting. ... For experiments on Image Net (Deng et al., 2009) we follow (Hubara et al., 2021a), by defining the calibration set for AP, g AP and the profile search to contain exactly one randomly-selected training image per class. |
| Hardware Specification | Yes | This is executed on a single NVIDIA 3090 GPU, and can be significantly optimized. ... Layer-wise timings for the AMD system are collected on an Amazon AWS c5a.8xlarge machine with 16 cores, while for Intel CPUs we use a c5.9xlarge server with 18 cores. ... on an AWS c5.12xlarge instance. |
| Software Dependencies | Yes | We measure speedups and execute inference on the publicly-available Deep Sparse v0.9.1 CPU inference engine (Neural Magic, 2021; Kurtz et al., 2020), which is competitive when executing dense models with the standard ONNX and Open VINO runtimes... |
| Experiment Setup | Yes | In all our experiments, we use the same set of sparsity targets for each layer S = {0} {1 (1 0.4) δi | i = 0, . . . , 40} with δ = ((1 0.99)/(1 0.4))1/40. ... For time discretization, we always use B = 104 buckets as individual units of time. ... The reconstruction database generation performs 10 epochs of optimization over this calibration set, using Adam (Kingma & Ba, 2015) with batchsize 32 and learning rate 10 3 per sparsity level while g AP runs for 100 epochs with learning rate 10 5 and frozen batch norms. ... The listed speedups are for batchsize 64, except for BERT (Devlin et al., 2019), which uses batchsize 16. |