Is normalization indispensable for training deep neural network?
Authors: Jie Shao, Kai Hu, Changhu Wang, Xiangyang Xue, Bhiksha Raj
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our method on a wide range of tasks. On Image Net, our un-normalized Rescale Net models can achieve the same or slightly better performance than the corresponding normalized models (Res Net, Res Next) with the same training settings. Our un-normalized Rescale Net variant on Res Net50 has 0.3% lower error than its BN/GN Res Net50 counterpart. Our method can also apply to conventional non-residual networks. Our 19 layer VGG [30] model without normalization achieves a top-1 validation error rate of 25.0%, which is 2.6% lower than Py Torch s pre-trained model [26]. Our method also shows consistent improvement on Mask R-CNN for COCO object detection and segmentation [20], 3D convolutional networks for Kinetics video classification [18], and deep transformers for WMT English-German machine translation [34]. In cases where normalization operations may cause problems, our method can be a competitive alternative. and 5 Experiments section. |
| Researcher Affiliation | Collaboration | 1Fudan University, Shanghai, China 2Carnegie Mellon University, Pittsburgh, PA 3Byte Dance AI Lab, Shanghai, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes are available at https://github.com/ hukkai/rescaling. |
| Open Datasets | Yes | We experiment in the Image Net classification dataset [8]. The dataset contains 128k training images and 50k validation images that are labeled with 1000 categories. and Our method also shows consistent improvement on Mask R-CNN for COCO object detection and segmentation [20], 3D convolutional networks for Kinetics video classification [18], and deep transformers for WMT English-German machine translation [34]. |
| Dataset Splits | Yes | The dataset contains 128k training images and 50k validation images that are labeled with 1000 categories. and trained in the COCO train2017 set and evaluated on the COCO val2017 set. |
| Hardware Specification | No | The paper mentions using GPUs for training (e.g., '8 GPUs, 2 images per GPU' for COCO), but it does not provide specific hardware details such as the models of GPUs or CPUs, or any detailed computer specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions the use of 'PyTorch implementations' and the 'fairseq library', but it does not specify any version numbers for these or any other software components. |
| Experiment Setup | Yes | During training, we adopt random resized crop with a 224 224 crop size, and random horizontal flip for data augmentation. We use SGD to train the models for 100 epochs. We use a weight decay of 0.0001 for all weight layers, and no weight decay for the bias and multipliers. We report the top-1 classification error on the 224 224 center-crop in the validation set. All results are averaged over 5 runs. The default setting is to train the model with a batch size of 256 and an initial learning rate of 0.1. The learning rate is decreased at 30, 60, 90 epochs. |