Neural Networks with Few Multiplications
Authors: Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, Yoshua Bengio
ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across 3 popular datasets (MNIST, CIFAR10, SVHN) show that this approach not only does not hurt classification performance but can result in even better performance than standard stochastic gradient descent training, paving the way to fast, hardwarefriendly training of neural networks. |
| Researcher Affiliation | Academia | Zhouhan Lin Universit e de Montr eal Canada zhouhan.lin@umontreal.ca Matthieu Courbariaux Universit e de Montr eal Canada matthieu.courbariaux@gmail.com Roland Memisevic Universit e de Montr eal Canada roland.umontreal@gmail.com Yoshua Bengio Universit e de Montr eal Canada |
| Pseudocode | Yes | Algorithm 1 Quantized Back Propagation (QBP). |
| Open Source Code | Yes | The codes for these approaches are available online at https://github.com/hantek/ Binary Connect |
| Open Datasets | Yes | We experimented with 3 datasets: MNIST, CIFAR10, and SVHN. |
| Dataset Splits | Yes | The training set is separated into two parts, one of which is the training set with 40000 images and the other the validation set with 10000 images. |
| Hardware Specification | No | The paper mentions training on "GPU or CPU clusters" but does not provide specific hardware models (e.g., GPU/CPU models, memory) used for the experiments. |
| Software Dependencies | No | Our implementation uses Theano (Bastien et al., 2012). This mentions software but does not provide a specific version number for Theano. |
| Experiment Setup | Yes | All models are trained with stochastic gradient descent (SGD) without momentum. We use batch normalization for all the models to accelerate learning. At training time, binary (ternary) connect and quantized back propagation are used, while at test time, we use the learned full resolution weights for the forward propagation. For each dataset, all hyper-parameters are set to the same values for the different methods, except that the learning rate is adapted independently for each one. The MNIST model uses a fully connected network with 4 layers: 784-1024-1024-1024-10. The training set is separated into two parts, one of which is the training set with 40000 images and the other the validation set with 10000 images. Training is conducted in a mini-batch way, with a batch size of 200. |