AC-GC: Lossy Activation Compression with Guaranteed Convergence

Authors: R David Evans, Tor Aamodt

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We examine activation compression by modifying the Chainer framework [53] to compress and decompress activations during training. We measure compression rates every 100 iterations, and otherwise perform paired compression/decompression to maintain the highest performance for our experiments. We focus our analysis on CNNs with image and text datasets, as they have large activation memory requirements, but avoid the largest networks [22, 52] due to limited resources. We create a performance implementation based off Chen et al. [7] to measure throughput. For Image Net [11], CIFAR10 [2] and Div2K [1], we use SGD and 0.9 momentum for VGG16 [50], Res Nets (RN18 and RN50) [20], Wide Res Net (WRN) [59], and VDSR [29]. IMDB [39] and Text Copy [4] are trained using ADAM with CNN [53], RNN [53], and transformer heads [54].
Researcher Affiliation Academia R. David Evans Dept. of Electrical and Computer Engineering University of British Columbia Vancouver, BC V6T 1Z4 rdevans@ece.ubc.ca Tor M. Aamodt Dept. of Electrical and Computer Engineering University of British Columbia Vancouver, BC V6T 1Z4 aamodt@ece.ubc.ca
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Code is available https://github.com/rdevans0/acgc.
Open Datasets Yes For Image Net [11], CIFAR10 [2] and Div2K [1], we use SGD and 0.9 momentum for VGG16 [50], Res Nets (RN18 and RN50) [20], Wide Res Net (WRN) [59], and VDSR [29]. IMDB [39] and Text Copy [4] are trained using ADAM with CNN [53], RNN [53], and transformer heads [54]. All image datasets are augmented with random sizing, flip, and crop, as well as whitening and PCA for Image Net [30], and 8 8 cutout for CIFAR10 [12]. Learning rates, batch sizes, and epochs are 0.05, 128, 300 (CIFAR10, [49]), 0.1, 64, 105 (Image Net, [58]), 0.1, 32, 110 (Div2K, grid search), 2.0, 64, 100 (Text Copy, [4]), and 0.001, 64, 20 (IMDB, [53]).
Dataset Splits No The paper does not explicitly provide percentages or counts for training, test, and validation dataset splits. While standard datasets were used, the specific split information is not stated in the text.
Hardware Specification Yes Table 2: Trained using 900 GPU-days (RTX 2080 Ti).
Software Dependencies No The paper mentions modifying the 'Chainer framework [53]' but does not specify a version number for Chainer or any other ancillary software.
Experiment Setup Yes Learning rates, batch sizes, and epochs are 0.05, 128, 300 (CIFAR10, [49]), 0.1, 64, 105 (Image Net, [58]), 0.1, 32, 110 (Div2K, grid search), 2.0, 64, 100 (Text Copy, [4]), and 0.001, 64, 20 (IMDB, [53]). Unless otherwise stated, all experiments use e2 = 0.5, parameter estimates from the mean of a ten entry window, and a recalculation interval of 100 iterations.