Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization

Authors: Junjie Yan, Ruosi Wan, Xiangyu Zhang, Wei Zhang, Yichen Wei, Jian Sun

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove the benefits of MABN by both theoretical analysis and experiments. Our experiments demonstrate the effectiveness of MABN in multiple computer vision tasks including Image Net and COCO.
Researcher Affiliation Collaboration Junjie Yan1,2 , Ruosi Wan3 , Xiangyu Zhang3 , Wei Zhang1,2, Yichen Wei3, Jian Sun3 1 Shanghai Key Laboratory of Intelligent Information Processing 2 School of Computer Science, Fudan University 3 Megvii Technology.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes The code has been released in https://github.com/megvii-model/MABN.
Open Datasets Yes MABN shows its effectiveness in multiple vision public datasets and tasks, including Image Net (Russakovsky et al., 2015), COCO (Lin et al., 2014).
Dataset Splits Yes The models are evaluated by top-1 classification error on center crops of 224 224 pixels in the validation set.
Hardware Specification No The paper mentions running experiments 'across 8 GPUs' but does not specify the model or type of GPUs or any other hardware components.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies.
Experiment Setup Yes All experiments on Image Net are conducted across 8 GPUs. We train models with a gradient batch size of Bg = 32 images per GPU. All weights from convolutions are initialized as He et al. (2015). We use 1 to initialize all γ and 0 to initialize all β in normalization layers. We use a weight decay of 10^-4 for all weight layers including γ and β (following Wu & He (2018)). We train 600,000 iterations (approximately equal to 120 epoch when gradient batch size is 256) for all models, and divide the learning rate by 10 at 150,000, 300,000 and 450,000 iterations. The data augmentation follows Gross & Wilber (2016). In vanilla BN or BRN, the momentum α = 0.9, in MABN, the momentum α = 0.98.