Understanding the Disharmony between Weight Normalization Family and Weight Decay

Authors: Xiang Li, Shuo Chen, Jian Yang4715-4722

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of our method is demonstrated by experiments on the large-scale Image Net dataset. Table 1 shows the impact of λ on network training with two widely used optimizers SGD (Sutskever et al. 2013) with momentum and Adam (Kingma and Ba 2014) Table 3: Accuracy via single 224 crop on Image Net validation set of different backbones using SGD.
Researcher Affiliation Collaboration Xiang Li,1,2 Shuo Chen,1 Jian Yang1 1Deep Insight@PCALab, Nanjing University of Science and Technology 2Momenta
Pseudocode No No structured pseudocode or algorithm blocks found.
Open Source Code No No concrete access to source code (repository link, explicit release statement, or code in supplementary materials) for the methodology described in this paper was provided.
Open Datasets Yes We conduct comprehensive experiments on the Image Net (Deng et al. 2009) dataset accordingly. On the large-scale tasks (Image Net (Deng et al. 2009) clas- sification/COCO (Lin et al. 2014) detection)
Dataset Splits Yes We train networks on the training set and report the Top-1 and Top-5 accuracies on the validation set with single 224 224 central crop. All networks are trained from scratch by SGD (Sutskever et al. 2013) or Adam (Kingma and Ba 2014; Loshchilov and Hutter 2019). SGD is with weight decay 0.0001 and momentum 0.9 for 100 epochs, starting from learning rate 0.1 and decreasing it by a factor of 10 every 30 epochs.
Hardware Specification No The total batch size is set as 256 and 8 GPUs (32 images per GPU) are utilized for training. No specific GPU model or other hardware details were provided.
Software Dependencies No For a fair comparison, all experiments are run under a unified pytorch (Paszke et al. 2017) framework. No specific version numbers for software dependencies were provided.
Experiment Setup Yes The training settings are kept similar with (Li, Hu, and Yang 2019), except that we set the weight decay ratio λ to 0 for all the bias part in networks (He et al. 2019)... All networks are trained from scratch by SGD (Sutskever et al. 2013) or Adam (Kingma and Ba 2014; Loshchilov and Hutter 2019). SGD is with weight decay 0.0001 and momentum 0.9 for 100 epochs, starting from learning rate 0.1 and decreasing it by a factor of 10 every 30 epochs. Adam keeps the default settings with learning rate 0.001, β1 = 0.9, β2 = 0.999. The total batch size is set as 256 and 8 GPUs (32 images per GPU) are utilized for training.