Backprop with Approximate Activations for Memory-efficient Network Training

Authors: Ayan Chakrabarti, Benjamin Moseley

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on CIFAR-10, CIFAR-100, and Image Net show that our method yields performance close to exact training, while storing activations compactly with as low as 4-bit precision.
Researcher Affiliation Academia Ayan Chakrabarti Washington University in St. Louis 1 Brookings Dr., St. Louis, MO 63130 ayan@wustl.edu Benjamin Moseley Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 moseleyb@andrew.cmu.edu
Pseudocode No The paper describes the computational steps using mathematical equations and descriptive text, but it does not include any pseudocode or algorithm blocks.
Open Source Code Yes Our reference implementation is available at http://projects.ayanc.org/blpa/.
Open Datasets Yes Experiments on CIFAR-10, CIFAR-100, and Image Net show that our method yields performance close to exact training... We begin with comparisons on 164-layer pre-activation residual networks [9] on CIFAR-10 and CIFAR-100 [13]... For Image Net [18], we train models with a 152-layer residual architecture...
Dataset Splits Yes For Image Net, ... Table 1 reports top-5 validation accuracy (using 10 crops at a scale of 256) for models trained using exact computation, and our approach with K = 8 and K = 4 bit approximations.
Hardware Specification Yes For the CIFAR experiments, we were able to fit the full 128-size batch on a single 1080Ti GPU... caused an out-of-memory error on a 1080Ti GPU.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes We train the network for 64k iterations with a batch size of 128, momentum of 0.9, and weight decay of 2e-4. Following [9], the learning rate is set to 1e-2 for the first 400 iterations, then increased to 1e-1, and dropped by a factor of 10 at 32k and 48k iterations. We use standard data-augmentation with random translation and horizontal flips.