An Empirical study of Binary Neural Networks' Optimisation

Authors: Milad Alizadeh, Javier Fernández-Marqués, Nicholas D. Lane, Yarin Gal

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we empirically identify and study the effectiveness of the various ad-hoc techniques commonly used in the literature, providing best-practices for efficient training of binary models. We show that adapting learning rates using second-moment methods is crucial for the successful use of the STE, and that other optimisers can easily get stuck in local minima. We also find that many of the commonly employed tricks are only effective towards the end of the training, with these methods making early stages of the training considerably slower. Our analysis disambiguates necessary from unnecessary ad-hoc techniques for the training of binary neural networks, paving the way for future development of solid theoretical foundations for these.
Researcher Affiliation Academia Milad Alizadeh, Javier Fern and ez-Marqu es, Nicholas D. Lane & Yarin Gal Department of Computer Science University of Oxford
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We provide our reference implementations and training-evaluation source code online 1. https://github.com/mi-lad/studying-binary-neural-networks
Open Datasets Yes A CNN inspired by VGG-10 (Simonyan & Zisserman, 2015) on CIFAR-10 (Krizhevsky & Hinton, 2009) dataset and an MLP with three hidden layers with 2048 units and rectified linear units (Re LUs) for MNIST (Le Cun, 1998) dataset.
Dataset Splits Yes We use the last 10% of the training set for validation and report the best accuracy on the test set associated with the highest validation accuracy achieved during training.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers.
Experiment Setup Yes We ran experiments for more epochs than typically required for the datasets (up to 500 epochs depending on the experiment). In each experiment, the relevant hyper-parameters were tuned for best results. We make use of gradient and weight clipping and squared hinge loss unless stated otherwise. We experimented with different hyper-parameters in ADAM optimiser (see Figure 2c) and found the decay rate for the second moment estimate to play a significant role. The default momentum rate for this running average is usually large, e.g. 0.99. We noted that some binary models use smaller values for this hyperparameter. Applying Max Pooling to a binary vector results in a vector with almost all ones. We have seen two variants of block reordering and in both cases (see Figure 3), pooling is done immediately after the convolution operator where the vector is not binary. In our experiments, not making this change resulted in significant accuracy loss. Training a binary model in two stages: (1) using vanilla STE in the first stage with higher learning rates and (2) turning clippings back on when the accuracy stops improving by reducing learning rate.