Understanding the Disharmony between Weight Normalization Family and Weight Decay
Authors: Xiang Li, Shuo Chen, Jian Yang4715-4722
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of our method is demonstrated by experiments on the large-scale Image Net dataset. Table 1 shows the impact of λ on network training with two widely used optimizers SGD (Sutskever et al. 2013) with momentum and Adam (Kingma and Ba 2014) Table 3: Accuracy via single 224 crop on Image Net validation set of different backbones using SGD. |
| Researcher Affiliation | Collaboration | Xiang Li,1,2 Shuo Chen,1 Jian Yang1 1Deep Insight@PCALab, Nanjing University of Science and Technology 2Momenta |
| Pseudocode | No | No structured pseudocode or algorithm blocks found. |
| Open Source Code | No | No concrete access to source code (repository link, explicit release statement, or code in supplementary materials) for the methodology described in this paper was provided. |
| Open Datasets | Yes | We conduct comprehensive experiments on the Image Net (Deng et al. 2009) dataset accordingly. On the large-scale tasks (Image Net (Deng et al. 2009) clas- sification/COCO (Lin et al. 2014) detection) |
| Dataset Splits | Yes | We train networks on the training set and report the Top-1 and Top-5 accuracies on the validation set with single 224 224 central crop. All networks are trained from scratch by SGD (Sutskever et al. 2013) or Adam (Kingma and Ba 2014; Loshchilov and Hutter 2019). SGD is with weight decay 0.0001 and momentum 0.9 for 100 epochs, starting from learning rate 0.1 and decreasing it by a factor of 10 every 30 epochs. |
| Hardware Specification | No | The total batch size is set as 256 and 8 GPUs (32 images per GPU) are utilized for training. No specific GPU model or other hardware details were provided. |
| Software Dependencies | No | For a fair comparison, all experiments are run under a unified pytorch (Paszke et al. 2017) framework. No specific version numbers for software dependencies were provided. |
| Experiment Setup | Yes | The training settings are kept similar with (Li, Hu, and Yang 2019), except that we set the weight decay ratio λ to 0 for all the bias part in networks (He et al. 2019)... All networks are trained from scratch by SGD (Sutskever et al. 2013) or Adam (Kingma and Ba 2014; Loshchilov and Hutter 2019). SGD is with weight decay 0.0001 and momentum 0.9 for 100 epochs, starting from learning rate 0.1 and decreasing it by a factor of 10 every 30 epochs. Adam keeps the default settings with learning rate 0.001, β1 = 0.9, β2 = 0.999. The total batch size is set as 256 and 8 GPUs (32 images per GPU) are utilized for training. |