Relaxed Quantization for Discretized Neural Networks

Authors: Christos Louizos, Matthias Reisser, Tijmen Blankevoort, Efstratios Gavves, Max Welling

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally validate the performance of our method on MNIST, CIFAR 10 and Imagenet classification.
Researcher Affiliation Collaboration Christos Louizos University of Amsterdam TNO Intelligent Imaging c.louizos@uva.nl Matthias Reisser QUVA Lab University of Amsterdam m.reisser@uva.nl Tijmen Blankevoort Qualcomm AI Research tijmen@qti.qualcomm.com Efstratios Gavves QUVA Lab University of Amsterdam egavves@uva.nl Max Welling University of Amsterdam Qualcomm m.welling@uva.nl
Pseudocode Yes Algorithm 1 Quantization during training. ... Algorithm 2 Quantization during testing.
Open Source Code No The paper mentions that experiments were implemented with TensorFlow and Keras, and refers to a TensorFlow GitHub repository for a pre-trained MobileNet model and for Jacob et al. (2017)'s code, but does not provide a link to the authors' own implementation of RQ.
Open Datasets Yes We experimentally validate the performance of our method on MNIST, CIFAR 10 and Imagenet classification.
Dataset Splits Yes The final models were determined through early stopping using the validation loss computed with minibatch statistics, in case the model uses batch normalization.
Hardware Specification Yes In terms of wall-clock time, training the RQ model with a full (4 elements) grid took approximately 15 times as long as the high-precision baseline with an implementation in Tensorflow v1.11.0 and running on a single Titan-X Nvidia GPU.
Software Dependencies Yes All experiments were implemented with Tensor Flow (Abadi et al., 2015), using the Keras library (Chollet et al., 2015). ... running on a single Titan-X Nvidia GPU.
Experiment Setup Yes For the MNIST experiment we rescaled the input to the [-1, 1] range, employed no regularization and the network was trained with Adam (Kingma & Ba, 2014) and a batch size of 128. We used a local grid whenever the bit width was larger than 2 for both, weights and biases (shared grid parameters), as well as for the ouputs of the Re LU, with δ = 3. For the 8 and 4 bit networks we used a temperature λ of 2 whereas for the 2 bit models we used a temperature of 1 for RQ. We trained the 8 and 4 bit networks for 100 epochs using a learning rate of 1e-3 and the 2 bit networks for 200 epochs with a learning rate of 5e-4. In all of the cases the learning rate was annealed to zero during the last 50 epochs.