Understanding Batch Normalization
Authors: Nils Bjorck, Carla P. Gomes, Bart Selman, Kilian Q. Weinberger
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct several experiments, and show that BN primarily enables training with larger learning rates, which is the cause for faster convergence and better generalization. (Abstract) and To investigate batch normalization we will use an experimental setup similar to the original Resnet paper [17]: image classification on CIFAR10 [27] with a 110 layer Resnet. (Section 1.2) |
| Researcher Affiliation | Academia | Johan Bjorck, Carla Gomes, Bart Selman, Kilian Q. Weinberger Cornell University {njb225,gomes,selman,kqw4} @cornell.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to an online version on arXiv ([4]) for further details, which typically contains the paper itself, not source code. There is no explicit statement about releasing code or a direct link to a code repository for the methodology. |
| Open Datasets | Yes | To investigate batch normalization we will use an experimental setup similar to the original Resnet paper [17]: image classification on CIFAR10 [27] |
| Dataset Splits | No | The paper mentions training on CIFAR10 and using various initial learning rates, but does not explicitly state the use of a validation set or describe validation splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions general techniques and tools like "SGD" and "data augmentation" but does not specify any software dependencies with version numbers (e.g., library names with specific versions). |
| Experiment Setup | Yes | We use SGD with momentum and weight decay, employ standard data augmentation and image preprocessing techniques and decrease learning rate when learning plateaus, all as in [17] and with the same parameter values. The original network can be trained with initial learning rate 0.1 over 165 epochs, however which fails without BN. We always report the best results among initial learning rates from {0.1, 0.003, 0.001, 0.0003, 0.0001, 0.00003} and use enough epochs such that learning plateaus. |