Balanced Sparsity for Efficient DNN Inference on GPU

Authors: Zhuliang Yao, Shijie Cao, Wencong Xiao, Chen Zhang, Lanshun Nie5676-5683

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results show that Balanced Sparsity achieves up to 3.1x practical speedup for model inference on GPU, while retains the same high model accuracy as finegrained sparsity.
Researcher Affiliation Collaboration Zhuliang Yao,1,4, Shijie Cao,2,4, Wencong Xiao,3,4 Chen Zhang,4 Lanshun Nie2 1Tsinghua University 2Harbin Institute of Technology 3Beihang University 4Microsoft Research Asia {v-zhuyao, v-shicao, v-wencxi, zhac}@microsoft.com, nls@hit.edu.cn
Pseudocode Yes Algorithm 1: Balance-aware Iterative Pruning Input: The matrix to be pruned, M; The number of blocks per row, Block Num; The expected sparsity, Sparsity; Output: The pruned matrix, Mp;
Open Source Code Yes Please refer to https://github.com/Howal/balanced-sparsity/blob/master/ appendix-aaai19.pdf for proof.
Open Datasets Yes PTB dataset (Marcus et al. 1999), Image Net ILSVRC-2012 dataset (Krizhevsky, Sutskever, and Hinton 2012), TIMIT dataset
Dataset Splits Yes VGG-16... dataset has 1.2M training examples and 50k validation examples.
Hardware Specification No The paper mentions experiments were run 'on GPU' and refers to 'GPU architecture' and 'GPU inference performance test' but does not specify any particular GPU model (e.g., NVIDIA A100, Tesla V100), CPU, or other hardware specifications.
Software Dependencies No The paper mentions using 'cu BLAS library', 'cu SPARSE library', and an 'open sourced GPU library (Gray, Radford, and Kingma 2017)' but does not specify version numbers for these software components or any other software dependencies.
Experiment Setup Yes All the experiments in this section are done with a batch size of 1, the block number per row of our method is 32, and the block size of block sparsity is 8x8, unless explicitly stated.