8-Bit Approximations for Parallelism in Deep Learning
Authors: Tim Dettmers
ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop and test 8-bit approximation algorithms which make better use of the available bandwidth by compressing 32-bit gradients and nonlinear activations to 8-bit approximations. We show that these approximations do not decrease predictive performance on MNIST, CIFAR10, and Image Net for both model and data parallelism and provide a data transfer speedup of 2x relative to 32-bit parallelism. We build a predictive model for speedups based on our experimental data, verify its validity on known speedup data, and show that we can obtain a speedup of 50x and more on a system of 96 GPUs compared to a speedup of 23x for 32-bit. |
| Researcher Affiliation | Academia | Tim Dettmers The Faculty of Informatics Universi a della Svizzera italiana Via Giuseppe Buffi13, CH-6904 Lugano, Switzerland tim.dettmers@gmail.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Implementations of our 8-bit approximation algorithms are available online^2. ^2https://github.com/Tim Dettmers/clusterNet/; contact me if you need help with integrating the functions into your library |
| Open Datasets | Yes | We show that these approximations do not decrease predictive performance on MNIST, CIFAR10, and Image Net for both model and data parallelism and provide a data transfer speedup of 2x relative to 32-bit parallelism. |
| Dataset Splits | No | The paper mentions using well-known datasets like MNIST, CIFAR10, and ImageNet for testing, but it does not explicitly provide the specific percentages or sample counts for training, validation, and test splits within the text. |
| Hardware Specification | Yes | On average, these algorithms perform compression and decompression in 1 and 0.5 nanoseconds per number, respectively, as measured on a NVIDIA GTX Titan. The algorithms were run on two NVIDIA GTX Titans. Also we used a different GPU (GTX Titan X). |
| Software Dependencies | Yes | The standard size of gradients in deep learning is currently 32-bit, which is the smallest practical dimension for floating point numbers on GPUs, as CUDA only supports 32 and 64-bit floating point arithmetic. We used the message passing interface (MPI) implementation provided by Open MPI 1.8.5, which uses low level CUDA routines to enable GPU-to-GPU communication without the help of the CPU. |
| Experiment Setup | Yes | For our tests on MNIST we used rectified linear units, a 784x1024x1024x10 architecture with dropout (0.2,0.3,0.3), a learning rate of 0.003 and RMSProp (Tieleman & Hinton, 2012). We used a convolutional network with two convolutional layers (64x5x5, 64x3x3) which were followed by max-pooling (3x3) and contrast normalization after each layer. These layers were followed by two locally connected convolutional layers (no weight sharing) and a final fully connected softmax layer. |