reproducibilityindex.ai

Understanding the Disharmony between Weight Normalization Family and Weight Decay

Authors: Xiang Li, Shuo Chen, Jian Yang4715-4722

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The effectiveness of our method is demonstrated by experiments on the large-scale Image Net dataset. Table 1 shows the impact of λ on network training with two widely used optimizers SGD (Sutskever et al. 2013) with momentum and Adam (Kingma and Ba 2014) Table 3: Accuracy via single 224 crop on Image Net validation set of different backbones using SGD.
Researcher Affiliation	Collaboration	Xiang Li,1,2 Shuo Chen,1 Jian Yang1 1Deep Insight@PCALab, Nanjing University of Science and Technology 2Momenta
Pseudocode	No	No structured pseudocode or algorithm blocks found.
Open Source Code	No	No concrete access to source code (repository link, explicit release statement, or code in supplementary materials) for the methodology described in this paper was provided.
Open Datasets	Yes	We conduct comprehensive experiments on the Image Net (Deng et al. 2009) dataset accordingly. On the large-scale tasks (Image Net (Deng et al. 2009) clas- siﬁcation/COCO (Lin et al. 2014) detection)
Dataset Splits	Yes	We train networks on the training set and report the Top-1 and Top-5 accuracies on the validation set with single 224 224 central crop. All networks are trained from scratch by SGD (Sutskever et al. 2013) or Adam (Kingma and Ba 2014; Loshchilov and Hutter 2019). SGD is with weight decay 0.0001 and momentum 0.9 for 100 epochs, starting from learning rate 0.1 and decreasing it by a factor of 10 every 30 epochs.
Hardware Specification	No	The total batch size is set as 256 and 8 GPUs (32 images per GPU) are utilized for training. No specific GPU model or other hardware details were provided.
Software Dependencies	No	For a fair comparison, all experiments are run under a uniﬁed pytorch (Paszke et al. 2017) framework. No specific version numbers for software dependencies were provided.
Experiment Setup	Yes	The training settings are kept similar with (Li, Hu, and Yang 2019), except that we set the weight decay ratio λ to 0 for all the bias part in networks (He et al. 2019)... All networks are trained from scratch by SGD (Sutskever et al. 2013) or Adam (Kingma and Ba 2014; Loshchilov and Hutter 2019). SGD is with weight decay 0.0001 and momentum 0.9 for 100 epochs, starting from learning rate 0.1 and decreasing it by a factor of 10 every 30 epochs. Adam keeps the default settings with learning rate 0.001, β1 = 0.9, β2 = 0.999. The total batch size is set as 256 and 8 GPUs (32 images per GPU) are utilized for training.