Towards Understanding Regularization in Batch Normalization

Authors: Ping Luo, Xinjiang Wang, Wenqi Shao, Zhanglin Peng

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that BN in convolutional neural networks share the same traits of regularization as the above analyses.
Researcher Affiliation Collaboration 1The Chinese University of Hong Kong 2Sense Time Research 3The University of Hong Kong
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets Yes We adopt CIFAR10 (Krizhevsky, 2009) that contains 60k images of 10 categories (50k images for training and 10k images for test).
Dataset Splits No The paper explicitly states "50k images for training and 10k images for test" for CIFAR10, but does not provide explicit details about the validation split (e.g., specific percentages or counts).
Hardware Specification No The paper mentions "trained on 8 GPUs" but does not specify the exact GPU models, CPU models, or other detailed computer specifications used for running its experiments.
Software Dependencies No The paper mentions using "SGD with momentum" but does not provide specific ancillary software details, such as library or solver names with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes All models are trained by using SGD with momentum, while the initial learning rates are scaled proportionally (Goyal et al., 2017) when different batch sizes are presented. More empirical setting can be found in Appendix B. For example, a momentum value of 0.9 and continuously decaying the learning rate by a factor of 10^-4 each step. For different batch sizes, the initial learning rate is scaled proportionally with the batch size to maintain a similar learning dynamics (Goyal et al., 2017). ...the initial learning rate is 0.1, which is then decayed by a factor of 10 after 30, 60, and 90 training epochs. ...add a dropout after each BN layer...with ratio 0.1... with ratio 0.2...