Deconstructing the Regularization of BatchNorm
Authors: Yann Dauphin, Ekin Dogus Cubuk
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This study aims to decompose Batch Norm into separate mechanisms that are much simpler. We identify three effects of Batch Norm and assess their impact directly with ablations and interventions. Our experiments show that preventing explosive growth at the final layer at initialization and during training can recover a large part of Batch Norm s generalization boost. This regularization mechanism can lift accuracy by 2.9% for Resnet-50 on Imagenet without Batch Norm. |
| Researcher Affiliation | Industry | Yann N. Dauphin Google Research ynd@google.comEkin D. Cubuk Google Research cubuk@google.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about open-sourcing code or links to a code repository. |
| Open Datasets | Yes | CIFAR-10 CIFAR-10 has 50k samples for training and 10k samples for test evaluation. We tune hyperparameters using 5k of the training samples as a validation set. We then train a final model using the whole training set of 50,000 samples with the best hyperparameters for 10 different seeds, and report the median test accuracy. Each model is trained on a single GPU. The robustness is reported as the accuracy on CIFAR-10-C dataset (Hendrycks and Dietterich, 2019), averaged over all corruptions and their severities.SVHN SVHN has 73,257 samples for training and 26,032 samples for testing (note that we do not consider the 531,131 extra samples). We tune hyperparameters using 3,257 of the training samples as a validation set. We then train a final model using the whole training set of 73,257 samples with the best hyperparameters and report the median test accuracy for 10 different random initializations.Imagenet Imagenet is a large scale image recognition dataset with over 1.2M examples and 1000 classes. Following the past work, we report classification accuracy on the validation set of 50k examples. |
| Dataset Splits | Yes | CIFAR-10 CIFAR-10 has 50k samples for training and 10k samples for test evaluation. We tune hyperparameters using 5k of the training samples as a validation set. We then train a final model using the whole training set of 50,000 samples with the best hyperparameters for 10 different seeds, and report the median test accuracy. Each model is trained on a single GPU. The robustness is reported as the accuracy on CIFAR-10-C dataset (Hendrycks and Dietterich, 2019), averaged over all corruptions and their severities.SVHN SVHN has 73,257 samples for training and 26,032 samples for testing (note that we do not consider the 531,131 extra samples). We tune hyperparameters using 3,257 of the training samples as a validation set. We then train a final model using the whole training set of 73,257 samples with the best hyperparameters and report the median test accuracy for 10 different random initializations.Imagenet Imagenet is a large scale image recognition dataset with over 1.2M examples and 1000 classes. Following the past work, we report classification accuracy on the validation set of 50k examples. |
| Hardware Specification | No | The paper mentions that "Each model is trained on a single GPU" for CIFAR-10 experiments but does not specify the model or type of GPU, CPU, or any other hardware component. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | Wide Resnet We train Wide Res Net-28-10 architecture (Zagoruyko and Komodakis, 2016) on CIFAR-10, which is one of the most commonly used architectures for this dataset. We train the model for 200 epochs using a cosine learning rate decay with a batch size of 128. We use the standard data augmentation of horizontal flips and pad-and-crop. For models without Batch Norm, we use Fixup initialization. We use the activation function Swish (Ramachandran et al., 2017), with the ρ value initialized to 0.Resnet-50 Resnet-50 (He et al., 2016) has quickly become a standard architecture to evaluate on Imagenet. Our implementation makes use of the improvements proposed by Goyal et al. (2017). Only the networks trained without Batch Norm use Fixup initialization (Zhang et al., 2019). In order to improve results for the standardizing loss, we found it was useful to use a special case of Fixup where the residual branches are not initalized to zero: when the standardization loss is used we initialize the last layer of each residual block inversely proportional to the square root of the depth. The models are trained for 90 epochs with a batch size of 512 and a precision of float32. All results are the average of two seeds. The coefficient for the standardizing loss is cross-validated in the range from 10 7 to 10 5 logarithmically. We found that higher coefficients led to divergence at high learning rates. The coefficient for embedding and functional L2 was found in the range from 0 to 1 with increments of 0.1.Efficientnet Efficientnet (Tan and Le, 2019) is the state-of-the-art architecture on Imagenet at the time of writing. We will evaluate on the biggest version trained with Rand Augment (Cubuk et al., 2019b) data augmentation and without additional data, called the B8 (Xie et al., 2020). We follow the implementation of Tan and Le (2019) and the hyper-parameters they found optimal. We use early stopping based on a held-out validation set of 25022 images. All results reported are averaged over two random seeds. |