SAS: Structured Activation Sparsification

Authors: Yusuke Sekikawa, Shingo Yashima

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In extensive experiments, we demonstrate that increasing sparsity monotonically improves accuracy (up to 7% on CIFAR10) without increasing the mult count. Furthermore, we show that structured sparsification of activation scales better than that of weight given the same computational budget.
Researcher Affiliation Industry Yusuke Sekikwa, Shingo Yashima DENSO IT Lab. Inc., Tokyo, Japan yusuke.sekikawa, yashima.shingo@core.d-itlab.co.jp
Pseudocode Yes Listing 1: cu SAS: General SAS matmul library for Sparse Tensor Core. It is based on Cu Py Okuta et al. (2017) wrapper of cu SPARSELt. The index needs to be reordered before execution of sparse matmul (L8). Refer to supp. E for more details about the reorder specific to NVIDIA GPUs.
Open Source Code Yes https://github.com/Denso ITLab/sas_
Open Datasets Yes CIFAR-10 / CIFAR-100. We use Res Net18 (He et al., 2016) as one of the most popular architectures. For all the variants, we utilize our proposed ERAdam optimizer (section 2.4) in combination with k-decay scheduler (Zhang & Li, 2020); Image Net. We use Conv Ne Xt (Liu et al., 2022)
Dataset Splits No The paper does not explicitly provide specific percentages or sample counts for a validation set split. While it uses standard datasets like ImageNet which have predefined validation sets, it does not explicitly state the details for all experiments.
Hardware Specification Yes Figure 3: Speed benchmarking. SAS (1:2) vs SWS (1:2) for general matmul on NVIDIA A6000 GPU. ... We use four A100 GPUs (each holding 256 batches) with an update frequency of four to virtually construct the batch size of 4096.
Software Dependencies No The paper mentions software like 'Cu Py', 'cu SPARSELt', 'cu BLAS', 'Tensor Flow', 'Pytorch', and 'JAX', but it does not provide specific version numbers for these software components.
Experiment Setup Yes Table A1: Experimental setup CIFAR-10 CIFAR-100 Image Net Network Res Net18 Conv Ne Xt-B Batch size 512 4096 Training epochs 16/α 1000 600 Optimizer ERAdam (section 2.4) Adam W Scheduler Two cycle cosine with k Decay=2.0 Zhang & Li (2020) Cosine Initialization Kaiming-uniform He et al. (2015) Truncated Gaussian Base width α 4/8/16 2 Sparsity M Re LU/2/4/8/16 Re LU/2