reproducibilityindex.ai

8-Bit Approximations for Parallelism in Deep Learning

Authors: Tim Dettmers

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We develop and test 8-bit approximation algorithms which make better use of the available bandwidth by compressing 32-bit gradients and nonlinear activations to 8-bit approximations. We show that these approximations do not decrease predictive performance on MNIST, CIFAR10, and Image Net for both model and data parallelism and provide a data transfer speedup of 2x relative to 32-bit parallelism. We build a predictive model for speedups based on our experimental data, verify its validity on known speedup data, and show that we can obtain a speedup of 50x and more on a system of 96 GPUs compared to a speedup of 23x for 32-bit.
Researcher Affiliation	Academia	Tim Dettmers The Faculty of Informatics Universi a della Svizzera italiana Via Giuseppe Bufﬁ13, CH-6904 Lugano, Switzerland tim.dettmers@gmail.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Implementations of our 8-bit approximation algorithms are available online^2. ^2https://github.com/Tim Dettmers/clusterNet/; contact me if you need help with integrating the functions into your library
Open Datasets	Yes	We show that these approximations do not decrease predictive performance on MNIST, CIFAR10, and Image Net for both model and data parallelism and provide a data transfer speedup of 2x relative to 32-bit parallelism.
Dataset Splits	No	The paper mentions using well-known datasets like MNIST, CIFAR10, and ImageNet for testing, but it does not explicitly provide the specific percentages or sample counts for training, validation, and test splits within the text.
Hardware Specification	Yes	On average, these algorithms perform compression and decompression in 1 and 0.5 nanoseconds per number, respectively, as measured on a NVIDIA GTX Titan. The algorithms were run on two NVIDIA GTX Titans. Also we used a different GPU (GTX Titan X).
Software Dependencies	Yes	The standard size of gradients in deep learning is currently 32-bit, which is the smallest practical dimension for ﬂoating point numbers on GPUs, as CUDA only supports 32 and 64-bit ﬂoating point arithmetic. We used the message passing interface (MPI) implementation provided by Open MPI 1.8.5, which uses low level CUDA routines to enable GPU-to-GPU communication without the help of the CPU.
Experiment Setup	Yes	For our tests on MNIST we used rectiﬁed linear units, a 784x1024x1024x10 architecture with dropout (0.2,0.3,0.3), a learning rate of 0.003 and RMSProp (Tieleman & Hinton, 2012). We used a convolutional network with two convolutional layers (64x5x5, 64x3x3) which were followed by max-pooling (3x3) and contrast normalization after each layer. These layers were followed by two locally connected convolutional layers (no weight sharing) and a ﬁnal fully connected softmax layer.