Sparse Weight Activation Training
Authors: Md Aamir Raihan, Tor Aamodt
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SWAT on recent CNN architectures such as Res Net, VGG, Dense Net and Wide Res Net using CIFAR-10, CIFAR-100 and Image Net datasets. For Res Net-50 on Image Net SWAT reduces total floating-point operations (FLOPs) during training by 80% resulting in a 3.3 training speedup when run on a simulated sparse learning accelerator representative of emerging platforms while incurring only 1.63% reduction in validation accuracy. Moreover, SWAT reduces memory footprint during the backward pass by 23% to 50% for activations and 50% to 90% for weights. Code is available at https: //github.com/Aamir Raihan/SWAT. |
| Researcher Affiliation | Academia | Md Aamir Raihan, Tor M. Aamodt Department of Electrical And Computer Engineering University of British Columbia Vancouver, BC {araihan,aamodt}@ece.ubc.ca |
| Pseudocode | Yes | Sparse weight activation training (SWAT) embodies these two strategies as follows (for pseudo-code see supplementary material) |
| Open Source Code | Yes | Code is available at https: //github.com/Aamir Raihan/SWAT. |
| Open Datasets | Yes | We evaluate SWAT on recent CNN architectures such as Res Net, VGG, Dense Net and Wide Res Net using CIFAR-10, CIFAR-100 and Image Net datasets. |
| Dataset Splits | Yes | For training runs with Image Net we employ the augmentation technique proposed by Krizhevsky et al. [27]: 224 224 random crops from the input images or their horizontal flip are used for training. Networks are trained with label smoothing [58] of 0.1 for 90 epochs with a batch size of 256 samples on a system with eight NVIDIA 2080Ti GPUs. |
| Hardware Specification | Yes | Networks are trained with label smoothing [58] of 0.1 for 90 epochs with a batch size of 256 samples on a system with eight NVIDIA 2080Ti GPUs. |
| Software Dependencies | Yes | We measure validation accuracy of SWAT by implementing custom convolution and linear layers in Py Torch 1.1.0 [48]. Inside each custom Py Torch layer we perform sparsification before performing the layer forward or backward pass computation. To obtain accuracy measurements in a reasonable time these custom layers invoke NVIDIA s cu DNN library using Pytorch s C++ interface. |
| Experiment Setup | Yes | We use SGD with momentum as an optimization algorithm with an initial learning rate of 0.1, momentum of 0.9, and weight decay λ of 0.0005. For training runs with Image Net we employ the augmentation technique proposed by Krizhevsky et al. [27]: 224 224 random crops from the input images or their horizontal flip are used for training. Networks are trained with label smoothing [58] of 0.1 for 90 epochs with a batch size of 256 samples on a system with eight NVIDIA 2080Ti GPUs. The learning rate schedule starts with a linear warm-up reaching its maximum of 0.1 at epoch 5 and is reduced by (1/10) at epochs 30th, 60th and 80th. The optimization method is SGD with Nesterov momentum of 0.9 and weight decay λ of 0.0001. |