Backprop with Approximate Activations for Memory-efficient Network Training
Authors: Ayan Chakrabarti, Benjamin Moseley
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on CIFAR-10, CIFAR-100, and Image Net show that our method yields performance close to exact training, while storing activations compactly with as low as 4-bit precision. |
| Researcher Affiliation | Academia | Ayan Chakrabarti Washington University in St. Louis 1 Brookings Dr., St. Louis, MO 63130 ayan@wustl.edu Benjamin Moseley Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 moseleyb@andrew.cmu.edu |
| Pseudocode | No | The paper describes the computational steps using mathematical equations and descriptive text, but it does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our reference implementation is available at http://projects.ayanc.org/blpa/. |
| Open Datasets | Yes | Experiments on CIFAR-10, CIFAR-100, and Image Net show that our method yields performance close to exact training... We begin with comparisons on 164-layer pre-activation residual networks [9] on CIFAR-10 and CIFAR-100 [13]... For Image Net [18], we train models with a 152-layer residual architecture... |
| Dataset Splits | Yes | For Image Net, ... Table 1 reports top-5 validation accuracy (using 10 crops at a scale of 256) for models trained using exact computation, and our approach with K = 8 and K = 4 bit approximations. |
| Hardware Specification | Yes | For the CIFAR experiments, we were able to fit the full 128-size batch on a single 1080Ti GPU... caused an out-of-memory error on a 1080Ti GPU. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | We train the network for 64k iterations with a batch size of 128, momentum of 0.9, and weight decay of 2e-4. Following [9], the learning rate is set to 1e-2 for the first 400 iterations, then increased to 1e-1, and dropped by a factor of 10 at 32k and 48k iterations. We use standard data-augmentation with random translation and horizontal flips. |