How to Start Training: The Effect of Initialization and Architecture

Authors: Boris Hanin, David Rolnick

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate empirically the effectiveness of our theoretical results in predicting when networks are able to start training. In particular, we note that many popular initializations fail our criteria, whereas correct initialization and architecture allows much deeper networks to be trained. In Figure 1, we compare the effects of different initializations in networks with varying depth, where the width is equal to the depth (this is done to prevent FM2, see 3.3). Figure 1(a) shows that, as predicted, initializations for which the variance of weights is smaller than the critical value of 2/fan-in lead to a dramatic decrease in output length, while variance larger than this value causes the output length to explode. Figure 1(b) compares the ability of differently initialized networks to start training; it shows the average number of epochs required to achieve 20% test accuracy on MNIST [19].
Researcher Affiliation Academia Boris Hanin Department of Mathematics Texas A& M University College Station, TX, USA bhanin@math.tamu.edu David Rolnick Department of Mathematics Massachusetts Institute of Technology Cambridge, MA, USA drolnick@mit.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes Figure 1(b) shows the number of epochs required to achieve 20% accuracy on vectorized MNIST, averaged over 5 training runs with independent initializations... (MNIST [19]). The input image from CIFAR-10 is shown. (CIFAR-10 [17]).
Dataset Splits No The paper mentions training on MNIST and CIFAR-10 and achieving test accuracy, but it does not specify the training, validation, and test splits (e.g., percentages or sample counts for each).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions "Keras deep learning Python library [4]" and "Py Torch [20]" but does not specify version numbers for these software dependencies, nor any other key software components with versions.
Experiment Setup Yes Datapoints in (a) represent the statistics over random unit inputs for 1,000 independently initialized networks, while (b) shows the number of epochs required to achieve 20% accuracy on vectorized MNIST, averaged over 5 training runs with independent initializations, where networks were trained using stochastic gradient descent with a fixed learning rate of 0.01 and batch size of 1024, for up to 100 epochs.