Sparse Weight Activation Training

Authors: Md Aamir Raihan, Tor Aamodt

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SWAT on recent CNN architectures such as Res Net, VGG, Dense Net and Wide Res Net using CIFAR-10, CIFAR-100 and Image Net datasets. For Res Net-50 on Image Net SWAT reduces total floating-point operations (FLOPs) during training by 80% resulting in a 3.3 training speedup when run on a simulated sparse learning accelerator representative of emerging platforms while incurring only 1.63% reduction in validation accuracy. Moreover, SWAT reduces memory footprint during the backward pass by 23% to 50% for activations and 50% to 90% for weights. Code is available at https: //github.com/Aamir Raihan/SWAT.
Researcher Affiliation Academia Md Aamir Raihan, Tor M. Aamodt Department of Electrical And Computer Engineering University of British Columbia Vancouver, BC {araihan,aamodt}@ece.ubc.ca
Pseudocode Yes Sparse weight activation training (SWAT) embodies these two strategies as follows (for pseudo-code see supplementary material)
Open Source Code Yes Code is available at https: //github.com/Aamir Raihan/SWAT.
Open Datasets Yes We evaluate SWAT on recent CNN architectures such as Res Net, VGG, Dense Net and Wide Res Net using CIFAR-10, CIFAR-100 and Image Net datasets.
Dataset Splits Yes For training runs with Image Net we employ the augmentation technique proposed by Krizhevsky et al. [27]: 224 224 random crops from the input images or their horizontal flip are used for training. Networks are trained with label smoothing [58] of 0.1 for 90 epochs with a batch size of 256 samples on a system with eight NVIDIA 2080Ti GPUs.
Hardware Specification Yes Networks are trained with label smoothing [58] of 0.1 for 90 epochs with a batch size of 256 samples on a system with eight NVIDIA 2080Ti GPUs.
Software Dependencies Yes We measure validation accuracy of SWAT by implementing custom convolution and linear layers in Py Torch 1.1.0 [48]. Inside each custom Py Torch layer we perform sparsification before performing the layer forward or backward pass computation. To obtain accuracy measurements in a reasonable time these custom layers invoke NVIDIA s cu DNN library using Pytorch s C++ interface.
Experiment Setup Yes We use SGD with momentum as an optimization algorithm with an initial learning rate of 0.1, momentum of 0.9, and weight decay λ of 0.0005. For training runs with Image Net we employ the augmentation technique proposed by Krizhevsky et al. [27]: 224 224 random crops from the input images or their horizontal flip are used for training. Networks are trained with label smoothing [58] of 0.1 for 90 epochs with a batch size of 256 samples on a system with eight NVIDIA 2080Ti GPUs. The learning rate schedule starts with a linear warm-up reaching its maximum of 0.1 at epoch 5 and is reduced by (1/10) at epochs 30th, 60th and 80th. The optimization method is SGD with Nesterov momentum of 0.9 and weight decay λ of 0.0001.