reproducibilityindex.ai

Deconstructing the Regularization of BatchNorm

Authors: Yann Dauphin, Ekin Dogus Cubuk

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study aims to decompose Batch Norm into separate mechanisms that are much simpler. We identify three effects of Batch Norm and assess their impact directly with ablations and interventions. Our experiments show that preventing explosive growth at the ﬁnal layer at initialization and during training can recover a large part of Batch Norm s generalization boost. This regularization mechanism can lift accuracy by 2.9% for Resnet-50 on Imagenet without Batch Norm.
Researcher Affiliation	Industry	Yann N. Dauphin Google Research ynd@google.comEkin D. Cubuk Google Research cubuk@google.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about open-sourcing code or links to a code repository.
Open Datasets	Yes	CIFAR-10 CIFAR-10 has 50k samples for training and 10k samples for test evaluation. We tune hyperparameters using 5k of the training samples as a validation set. We then train a ﬁnal model using the whole training set of 50,000 samples with the best hyperparameters for 10 different seeds, and report the median test accuracy. Each model is trained on a single GPU. The robustness is reported as the accuracy on CIFAR-10-C dataset (Hendrycks and Dietterich, 2019), averaged over all corruptions and their severities.SVHN SVHN has 73,257 samples for training and 26,032 samples for testing (note that we do not consider the 531,131 extra samples). We tune hyperparameters using 3,257 of the training samples as a validation set. We then train a ﬁnal model using the whole training set of 73,257 samples with the best hyperparameters and report the median test accuracy for 10 different random initializations.Imagenet Imagenet is a large scale image recognition dataset with over 1.2M examples and 1000 classes. Following the past work, we report classiﬁcation accuracy on the validation set of 50k examples.
Dataset Splits	Yes	CIFAR-10 CIFAR-10 has 50k samples for training and 10k samples for test evaluation. We tune hyperparameters using 5k of the training samples as a validation set. We then train a ﬁnal model using the whole training set of 50,000 samples with the best hyperparameters for 10 different seeds, and report the median test accuracy. Each model is trained on a single GPU. The robustness is reported as the accuracy on CIFAR-10-C dataset (Hendrycks and Dietterich, 2019), averaged over all corruptions and their severities.SVHN SVHN has 73,257 samples for training and 26,032 samples for testing (note that we do not consider the 531,131 extra samples). We tune hyperparameters using 3,257 of the training samples as a validation set. We then train a ﬁnal model using the whole training set of 73,257 samples with the best hyperparameters and report the median test accuracy for 10 different random initializations.Imagenet Imagenet is a large scale image recognition dataset with over 1.2M examples and 1000 classes. Following the past work, we report classiﬁcation accuracy on the validation set of 50k examples.
Hardware Specification	No	The paper mentions that "Each model is trained on a single GPU" for CIFAR-10 experiments but does not specify the model or type of GPU, CPU, or any other hardware component.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers.
Experiment Setup	Yes	Wide Resnet We train Wide Res Net-28-10 architecture (Zagoruyko and Komodakis, 2016) on CIFAR-10, which is one of the most commonly used architectures for this dataset. We train the model for 200 epochs using a cosine learning rate decay with a batch size of 128. We use the standard data augmentation of horizontal ﬂips and pad-and-crop. For models without Batch Norm, we use Fixup initialization. We use the activation function Swish (Ramachandran et al., 2017), with the ρ value initialized to 0.Resnet-50 Resnet-50 (He et al., 2016) has quickly become a standard architecture to evaluate on Imagenet. Our implementation makes use of the improvements proposed by Goyal et al. (2017). Only the networks trained without Batch Norm use Fixup initialization (Zhang et al., 2019). In order to improve results for the standardizing loss, we found it was useful to use a special case of Fixup where the residual branches are not initalized to zero: when the standardization loss is used we initialize the last layer of each residual block inversely proportional to the square root of the depth. The models are trained for 90 epochs with a batch size of 512 and a precision of ﬂoat32. All results are the average of two seeds. The coefﬁcient for the standardizing loss is cross-validated in the range from 10 7 to 10 5 logarithmically. We found that higher coefﬁcients led to divergence at high learning rates. The coefﬁcient for embedding and functional L2 was found in the range from 0 to 1 with increments of 0.1.Efﬁcientnet Efﬁcientnet (Tan and Le, 2019) is the state-of-the-art architecture on Imagenet at the time of writing. We will evaluate on the biggest version trained with Rand Augment (Cubuk et al., 2019b) data augmentation and without additional data, called the B8 (Xie et al., 2020). We follow the implementation of Tan and Le (2019) and the hyper-parameters they found optimal. We use early stopping based on a held-out validation set of 25022 images. All results reported are averaged over two random seeds.