Critical initialisation in continuous approximations of binary neural networks
Authors: George Stamatescu, Federica Gerace, Carlo Lucibello, Ian Fuss, Langford White
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We predict theoretically and confirm numerically, that common weight initialisation schemes used in standard continuous networks, when applied to the mean values of the stochastic binary weights, yield poor training performance. This study shows that, contrary to common intuition, the means of the stochastic binary weights should be initialised close to 1, for deeper networks to be trainable. The results of the theoretical study, which are supported by numerical simulations and experiment, establish that for a surrogate of arbitrary depth to be trainable, it must be randomly initialised at criticality . 4 NUMERICAL AND EXPERIMENTAL RESULTS |
| Researcher Affiliation | Academia | George Stamatescu, Ian Fuss and Langford B. White School of Electrical and Electronic Engineering University of Adelaide Adelaide, Australia {george.stamatescu}@gmail.com {lang.white,ian.fuss}@adelaide.edu.au Federica Gerace Institut de Physique Th eorique CNRS & CEA & Universit e Paris-Saclay Saclay, France federicagerace91@gmail.com Carlo Lucibello Bocconi Institute for Data Science and Analytics Bocconi University Milan, Italy carlo.lucibello@unibocconi.it |
| Pseudocode | No | The paper describes mathematical derivations and theoretical frameworks but does not include any distinct pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include any statements about releasing source code for the methodology or provide links to a code repository. |
| Open Datasets | Yes | We use the MNIST dataset with reduced training set size (50%) and record the training performance (percentage of the training set correctly labeled) after 10 epochs of gradient descent over the training set, for various network depths L < 70 and different mean variances σ2 m [0, 1). |
| Dataset Splits | No | The paper mentions using a 'reduced training set size (50%)' of the MNIST dataset, but it does not specify a validation set or describe how the data was split into training, validation, or test sets for model evaluation. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory configurations. |
| Software Dependencies | No | The paper mentions using 'SGD with Adam Kingma & Ba (2014)' as the optimizer, but it does not provide version numbers for Adam or any other software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | We use the MNIST dataset with reduced training set size (50%) and record the training performance (percentage of the training set correctly labeled) after 10 epochs of gradient descent over the training set, for various network depths L < 70 and different mean variances σ2 m [0, 1). The optimizer used was SGD with Adam Kingma & Ba (2014) with a learning rate of 2 10 4 chosen after simple grid search, and a batch size of 64. |