Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?
Authors: Boris Hanin
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant β, given by the sum of the reciprocals of the hidden layer widths. ... From this point of view, we rigorously compute finite width corrections to the statistics of gradients at the edge of chaos. The main contributions of this work are: 1. We derive new exact formulas for the joint even moments... 2. We prove that the empirical variance of gradients... 3. We prove that, so long as weights and biases... |
| Researcher Affiliation | Academia | Boris Hanin Department of Mathematics Texas A& M University College Station, TX, USA bhanin@math.tamu.edu |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code related to its described methodology. |
| Open Datasets | Yes | Figure 1: Comparison of early training dynamics on vectorized MNIST for fully connected ReLU nets with various architectures. ... (Figure reprinted with permission from [HR18] with caption modified). |
| Dataset Splits | No | The paper does not provide specific details on training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not specify any hardware details used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | No | The paper describes network characteristics (ReLU activations, random weight/bias initialization) but does not provide specific hyperparameters like learning rate, batch size, or optimizer settings for training. |