A Mean Field Theory of Batch Normalization
Authors: Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. |
| Researcher Affiliation | Industry | Microsoft Research AI, Google Brain gregyang@microsoft.com, {jpennin,vinaysrao,jaschasd,schsam}@google.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | In Fig. 3 (a) we consider networks trained using SGD on MNIST where we observe that networks deeper than about 50 layers are untrainable regardless of batch size. ... in (d) we train the networks on CIFAR10. |
| Dataset Splits | No | The paper does not provide specific dataset split information (e.g., exact percentages or sample counts for training, validation, and test sets) to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library names with versions). |
| Experiment Setup | Yes | Colors show test accuracy for rectified linear networks with batch normalization and γ = 1, β = 0, ϵ = 10 3, N = 384, and η = 10 5B. (a) trained on MNIST for 10 epochs (b) trained with fixed batch size 1000 and batch statistics computed over sub batches of size B. (c) trained using RMSProp. (d) Trained on CIFAR10 for 50 epochs. |