Understanding Batch Normalization

Authors: Nils Bjorck, Carla P. Gomes, Bart Selman, Kilian Q. Weinberger

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct several experiments, and show that BN primarily enables training with larger learning rates, which is the cause for faster convergence and better generalization. (Abstract) and To investigate batch normalization we will use an experimental setup similar to the original Resnet paper [17]: image classification on CIFAR10 [27] with a 110 layer Resnet. (Section 1.2)
Researcher Affiliation Academia Johan Bjorck, Carla Gomes, Bart Selman, Kilian Q. Weinberger Cornell University {njb225,gomes,selman,kqw4} @cornell.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper refers to an online version on arXiv ([4]) for further details, which typically contains the paper itself, not source code. There is no explicit statement about releasing code or a direct link to a code repository for the methodology.
Open Datasets Yes To investigate batch normalization we will use an experimental setup similar to the original Resnet paper [17]: image classification on CIFAR10 [27]
Dataset Splits No The paper mentions training on CIFAR10 and using various initial learning rates, but does not explicitly state the use of a validation set or describe validation splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions general techniques and tools like "SGD" and "data augmentation" but does not specify any software dependencies with version numbers (e.g., library names with specific versions).
Experiment Setup Yes We use SGD with momentum and weight decay, employ standard data augmentation and image preprocessing techniques and decrease learning rate when learning plateaus, all as in [17] and with the same parameter values. The original network can be trained with initial learning rate 0.1 over 165 epochs, however which fails without BN. We always report the best results among initial learning rates from {0.1, 0.003, 0.001, 0.0003, 0.0001, 0.00003} and use enough epochs such that learning plateaus.