AC-GC: Lossy Activation Compression with Guaranteed Convergence
Authors: R David Evans, Tor Aamodt
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We examine activation compression by modifying the Chainer framework [53] to compress and decompress activations during training. We measure compression rates every 100 iterations, and otherwise perform paired compression/decompression to maintain the highest performance for our experiments. We focus our analysis on CNNs with image and text datasets, as they have large activation memory requirements, but avoid the largest networks [22, 52] due to limited resources. We create a performance implementation based off Chen et al. [7] to measure throughput. For Image Net [11], CIFAR10 [2] and Div2K [1], we use SGD and 0.9 momentum for VGG16 [50], Res Nets (RN18 and RN50) [20], Wide Res Net (WRN) [59], and VDSR [29]. IMDB [39] and Text Copy [4] are trained using ADAM with CNN [53], RNN [53], and transformer heads [54]. |
| Researcher Affiliation | Academia | R. David Evans Dept. of Electrical and Computer Engineering University of British Columbia Vancouver, BC V6T 1Z4 rdevans@ece.ubc.ca Tor M. Aamodt Dept. of Electrical and Computer Engineering University of British Columbia Vancouver, BC V6T 1Z4 aamodt@ece.ubc.ca |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code is available https://github.com/rdevans0/acgc. |
| Open Datasets | Yes | For Image Net [11], CIFAR10 [2] and Div2K [1], we use SGD and 0.9 momentum for VGG16 [50], Res Nets (RN18 and RN50) [20], Wide Res Net (WRN) [59], and VDSR [29]. IMDB [39] and Text Copy [4] are trained using ADAM with CNN [53], RNN [53], and transformer heads [54]. All image datasets are augmented with random sizing, flip, and crop, as well as whitening and PCA for Image Net [30], and 8 8 cutout for CIFAR10 [12]. Learning rates, batch sizes, and epochs are 0.05, 128, 300 (CIFAR10, [49]), 0.1, 64, 105 (Image Net, [58]), 0.1, 32, 110 (Div2K, grid search), 2.0, 64, 100 (Text Copy, [4]), and 0.001, 64, 20 (IMDB, [53]). |
| Dataset Splits | No | The paper does not explicitly provide percentages or counts for training, test, and validation dataset splits. While standard datasets were used, the specific split information is not stated in the text. |
| Hardware Specification | Yes | Table 2: Trained using 900 GPU-days (RTX 2080 Ti). |
| Software Dependencies | No | The paper mentions modifying the 'Chainer framework [53]' but does not specify a version number for Chainer or any other ancillary software. |
| Experiment Setup | Yes | Learning rates, batch sizes, and epochs are 0.05, 128, 300 (CIFAR10, [49]), 0.1, 64, 105 (Image Net, [58]), 0.1, 32, 110 (Div2K, grid search), 2.0, 64, 100 (Text Copy, [4]), and 0.001, 64, 20 (IMDB, [53]). Unless otherwise stated, all experiments use e2 = 0.5, parameter estimates from the mean of a ten entry window, and a recalculation interval of 100 iterations. |