Training and Inference with Integers in Deep Neural Networks

Authors: Shuang Wu, Guoqi Li, Feng Chen, Luping Shi

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed framework is evaluated on MNIST, CIFAR10, SVHN, Image Net datasets. Comparing to those who only discretize weights and activations at inference time, it has comparable accuracy and can further alleviate overfitting, indicating some type of regularization. WAGE produces pure bidirectional low-precision integer dataflow for DNNs, which can be applied for training and inference in dedicated hardware neatly.
Researcher Affiliation Academia Shuang Wu1, Guoqi Li1, Feng Chen2, Luping Shi1 1Department of Precision Instrument 2Department of Automation Center for Brain Inspired Computing Research Beijing Innovation Center for Future Chip Tsinghua University {lpshi,chenfeng}@mail.tsinghua.edu.cn
Pseudocode Yes A ALGORITHM We assume that network structures are defined and initialized with Equation 5. The annotations after pseudo code are potential corresponding operations for implementation in a fixed-point dataflow. Algorithm 1 Training an I-layer net with WAGE method on floating-point-based or integer-based device. Weights, activations, gradients and errors are quantized according to Equations 6–12.
Open Source Code Yes We publish the code on Git Hub1. 1https://github.com/boluoweifenda/WAGE
Open Datasets Yes The proposed framework is evaluated on MNIST, CIFAR10, SVHN, Image Net datasets. Comparing to those who only discretize weights and activations at inference time, it has comparable accuracy and can further alleviate overfitting, indicating some type of regularization. Our method is evaluated on MNIST, SVHN, CIFAR10 and ILSVRC12 (Russakovsky et al., 2015) and Table 1 shows the comparison results.
Dataset Splits Yes For CIFAR10 dataset, we follow the data augmentation in Lee et al. (2015) for training: 4 pixels are padded on each side, and a 32 × 32 patch is randomly cropped from the padded image or its horizontal flip. For testing, only single view of the original 32 × 32 image is evaluated. ... For testing, the single center crop in validation set is evaluated.
Hardware Specification No The paper makes a generic reference to "floating-point hardware like GPU" but does not specify any particular GPU models, CPU models, or other hardware components used for running the experiments. It lacks specific details such as model numbers, memory, or processor types.
Software Dependencies No We first build the computation graph for a vanilla network, then insert quantization nodes in forward propagation and override gradients in backward propagation for each layer on Tensorflow (Abadi et al., 2016). The paper mentions TensorFlow but does not provide a specific version number or any other software dependencies with their versions.
Experiment Setup Yes In this section, we set W-A-G-E bits to 2-8-8-8 as default for all layers in a CNN or MLP. The learning rate η in WAGE remains as 1 for the whole 100 epochs. ... The model is trained with mini-batch size of 128 and totally 300 epochs. Learning rate η is set to 8 and divided by 8 at epoch 200 and epoch 250.