A Mean Field Theory of Batch Normalization

Authors: Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations.
Researcher Affiliation Industry Microsoft Research AI, Google Brain gregyang@microsoft.com, {jpennin,vinaysrao,jaschasd,schsam}@google.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes In Fig. 3 (a) we consider networks trained using SGD on MNIST where we observe that networks deeper than about 50 layers are untrainable regardless of batch size. ... in (d) we train the networks on CIFAR10.
Dataset Splits No The paper does not provide specific dataset split information (e.g., exact percentages or sample counts for training, validation, and test sets) to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library names with versions).
Experiment Setup Yes Colors show test accuracy for rectified linear networks with batch normalization and γ = 1, β = 0, ϵ = 10 3, N = 384, and η = 10 5B. (a) trained on MNIST for 10 epochs (b) trained with fixed batch size 1000 and batch statistics computed over sub batches of size B. (c) trained using RMSProp. (d) Trained on CIFAR10 for 50 epochs.