8-Bit Approximations for Parallelism in Deep Learning

Authors: Tim Dettmers

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop and test 8-bit approximation algorithms which make better use of the available bandwidth by compressing 32-bit gradients and nonlinear activations to 8-bit approximations. We show that these approximations do not decrease predictive performance on MNIST, CIFAR10, and Image Net for both model and data parallelism and provide a data transfer speedup of 2x relative to 32-bit parallelism. We build a predictive model for speedups based on our experimental data, verify its validity on known speedup data, and show that we can obtain a speedup of 50x and more on a system of 96 GPUs compared to a speedup of 23x for 32-bit.
Researcher Affiliation Academia Tim Dettmers The Faculty of Informatics Universi a della Svizzera italiana Via Giuseppe Buffi13, CH-6904 Lugano, Switzerland tim.dettmers@gmail.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Implementations of our 8-bit approximation algorithms are available online^2. ^2https://github.com/Tim Dettmers/clusterNet/; contact me if you need help with integrating the functions into your library
Open Datasets Yes We show that these approximations do not decrease predictive performance on MNIST, CIFAR10, and Image Net for both model and data parallelism and provide a data transfer speedup of 2x relative to 32-bit parallelism.
Dataset Splits No The paper mentions using well-known datasets like MNIST, CIFAR10, and ImageNet for testing, but it does not explicitly provide the specific percentages or sample counts for training, validation, and test splits within the text.
Hardware Specification Yes On average, these algorithms perform compression and decompression in 1 and 0.5 nanoseconds per number, respectively, as measured on a NVIDIA GTX Titan. The algorithms were run on two NVIDIA GTX Titans. Also we used a different GPU (GTX Titan X).
Software Dependencies Yes The standard size of gradients in deep learning is currently 32-bit, which is the smallest practical dimension for floating point numbers on GPUs, as CUDA only supports 32 and 64-bit floating point arithmetic. We used the message passing interface (MPI) implementation provided by Open MPI 1.8.5, which uses low level CUDA routines to enable GPU-to-GPU communication without the help of the CPU.
Experiment Setup Yes For our tests on MNIST we used rectified linear units, a 784x1024x1024x10 architecture with dropout (0.2,0.3,0.3), a learning rate of 0.003 and RMSProp (Tieleman & Hinton, 2012). We used a convolutional network with two convolutional layers (64x5x5, 64x3x3) which were followed by max-pooling (3x3) and contrast normalization after each layer. These layers were followed by two locally connected convolutional layers (no weight sharing) and a final fully connected softmax layer.