reproducibilityindex.ai

An Empirical study of Binary Neural Networks' Optimisation

Authors: Milad Alizadeh, Javier Fernández-Marqués, Nicholas D. Lane, Yarin Gal

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we empirically identify and study the effectiveness of the various ad-hoc techniques commonly used in the literature, providing best-practices for efﬁcient training of binary models. We show that adapting learning rates using second-moment methods is crucial for the successful use of the STE, and that other optimisers can easily get stuck in local minima. We also ﬁnd that many of the commonly employed tricks are only effective towards the end of the training, with these methods making early stages of the training considerably slower. Our analysis disambiguates necessary from unnecessary ad-hoc techniques for the training of binary neural networks, paving the way for future development of solid theoretical foundations for these.
Researcher Affiliation	Academia	Milad Alizadeh, Javier Fern and ez-Marqu es, Nicholas D. Lane & Yarin Gal Department of Computer Science University of Oxford
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We provide our reference implementations and training-evaluation source code online 1. https://github.com/mi-lad/studying-binary-neural-networks
Open Datasets	Yes	A CNN inspired by VGG-10 (Simonyan & Zisserman, 2015) on CIFAR-10 (Krizhevsky & Hinton, 2009) dataset and an MLP with three hidden layers with 2048 units and rectiﬁed linear units (Re LUs) for MNIST (Le Cun, 1998) dataset.
Dataset Splits	Yes	We use the last 10% of the training set for validation and report the best accuracy on the test set associated with the highest validation accuracy achieved during training.
Hardware Specification	No	The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers.
Experiment Setup	Yes	We ran experiments for more epochs than typically required for the datasets (up to 500 epochs depending on the experiment). In each experiment, the relevant hyper-parameters were tuned for best results. We make use of gradient and weight clipping and squared hinge loss unless stated otherwise. We experimented with different hyper-parameters in ADAM optimiser (see Figure 2c) and found the decay rate for the second moment estimate to play a signiﬁcant role. The default momentum rate for this running average is usually large, e.g. 0.99. We noted that some binary models use smaller values for this hyperparameter. Applying Max Pooling to a binary vector results in a vector with almost all ones. We have seen two variants of block reordering and in both cases (see Figure 3), pooling is done immediately after the convolution operator where the vector is not binary. In our experiments, not making this change resulted in signiﬁcant accuracy loss. Training a binary model in two stages: (1) using vanilla STE in the ﬁrst stage with higher learning rates and (2) turning clippings back on when the accuracy stops improving by reducing learning rate.