How to Start Training: The Effect of Initialization and Architecture
Authors: Boris Hanin, David Rolnick
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate empirically the effectiveness of our theoretical results in predicting when networks are able to start training. In particular, we note that many popular initializations fail our criteria, whereas correct initialization and architecture allows much deeper networks to be trained. In Figure 1, we compare the effects of different initializations in networks with varying depth, where the width is equal to the depth (this is done to prevent FM2, see 3.3). Figure 1(a) shows that, as predicted, initializations for which the variance of weights is smaller than the critical value of 2/fan-in lead to a dramatic decrease in output length, while variance larger than this value causes the output length to explode. Figure 1(b) compares the ability of differently initialized networks to start training; it shows the average number of epochs required to achieve 20% test accuracy on MNIST [19]. |
| Researcher Affiliation | Academia | Boris Hanin Department of Mathematics Texas A& M University College Station, TX, USA bhanin@math.tamu.edu David Rolnick Department of Mathematics Massachusetts Institute of Technology Cambridge, MA, USA drolnick@mit.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | Figure 1(b) shows the number of epochs required to achieve 20% accuracy on vectorized MNIST, averaged over 5 training runs with independent initializations... (MNIST [19]). The input image from CIFAR-10 is shown. (CIFAR-10 [17]). |
| Dataset Splits | No | The paper mentions training on MNIST and CIFAR-10 and achieving test accuracy, but it does not specify the training, validation, and test splits (e.g., percentages or sample counts for each). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions "Keras deep learning Python library [4]" and "Py Torch [20]" but does not specify version numbers for these software dependencies, nor any other key software components with versions. |
| Experiment Setup | Yes | Datapoints in (a) represent the statistics over random unit inputs for 1,000 independently initialized networks, while (b) shows the number of epochs required to achieve 20% accuracy on vectorized MNIST, averaged over 5 training runs with independent initializations, where networks were trained using stochastic gradient descent with a fixed learning rate of 0.01 and batch size of 1024, for up to 100 epochs. |