reproducibilityindex.ai

Sparse Weight Activation Training

Authors: Md Aamir Raihan, Tor Aamodt

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SWAT on recent CNN architectures such as Res Net, VGG, Dense Net and Wide Res Net using CIFAR-10, CIFAR-100 and Image Net datasets. For Res Net-50 on Image Net SWAT reduces total ﬂoating-point operations (FLOPs) during training by 80% resulting in a 3.3 training speedup when run on a simulated sparse learning accelerator representative of emerging platforms while incurring only 1.63% reduction in validation accuracy. Moreover, SWAT reduces memory footprint during the backward pass by 23% to 50% for activations and 50% to 90% for weights. Code is available at https: //github.com/Aamir Raihan/SWAT.
Researcher Affiliation	Academia	Md Aamir Raihan, Tor M. Aamodt Department of Electrical And Computer Engineering University of British Columbia Vancouver, BC {araihan,aamodt}@ece.ubc.ca
Pseudocode	Yes	Sparse weight activation training (SWAT) embodies these two strategies as follows (for pseudo-code see supplementary material)
Open Source Code	Yes	Code is available at https: //github.com/Aamir Raihan/SWAT.
Open Datasets	Yes	We evaluate SWAT on recent CNN architectures such as Res Net, VGG, Dense Net and Wide Res Net using CIFAR-10, CIFAR-100 and Image Net datasets.
Dataset Splits	Yes	For training runs with Image Net we employ the augmentation technique proposed by Krizhevsky et al. [27]: 224 224 random crops from the input images or their horizontal ﬂip are used for training. Networks are trained with label smoothing [58] of 0.1 for 90 epochs with a batch size of 256 samples on a system with eight NVIDIA 2080Ti GPUs.
Hardware Specification	Yes	Networks are trained with label smoothing [58] of 0.1 for 90 epochs with a batch size of 256 samples on a system with eight NVIDIA 2080Ti GPUs.
Software Dependencies	Yes	We measure validation accuracy of SWAT by implementing custom convolution and linear layers in Py Torch 1.1.0 [48]. Inside each custom Py Torch layer we perform sparsiﬁcation before performing the layer forward or backward pass computation. To obtain accuracy measurements in a reasonable time these custom layers invoke NVIDIA s cu DNN library using Pytorch s C++ interface.
Experiment Setup	Yes	We use SGD with momentum as an optimization algorithm with an initial learning rate of 0.1, momentum of 0.9, and weight decay λ of 0.0005. For training runs with Image Net we employ the augmentation technique proposed by Krizhevsky et al. [27]: 224 224 random crops from the input images or their horizontal ﬂip are used for training. Networks are trained with label smoothing [58] of 0.1 for 90 epochs with a batch size of 256 samples on a system with eight NVIDIA 2080Ti GPUs. The learning rate schedule starts with a linear warm-up reaching its maximum of 0.1 at epoch 5 and is reduced by (1/10) at epochs 30th, 60th and 80th. The optimization method is SGD with Nesterov momentum of 0.9 and weight decay λ of 0.0001.