Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Understanding the Disharmony between Weight Normalization Family and Weight Decay
Authors: Xiang Li, Shuo Chen, Jian Yang4715-4722
AAAI 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of our method is demonstrated by experiments on the large-scale Image Net dataset. Table 1 shows the impact of λ on network training with two widely used optimizers SGD (Sutskever et al. 2013) with momentum and Adam (Kingma and Ba 2014) Table 3: Accuracy via single 224 crop on Image Net validation set of different backbones using SGD. |
| Researcher Affiliation | Collaboration | Xiang Li,1,2 Shuo Chen,1 Jian Yang1 1Deep Insight@PCALab, Nanjing University of Science and Technology 2Momenta |
| Pseudocode | No | No structured pseudocode or algorithm blocks found. |
| Open Source Code | No | No concrete access to source code (repository link, explicit release statement, or code in supplementary materials) for the methodology described in this paper was provided. |
| Open Datasets | Yes | We conduct comprehensive experiments on the Image Net (Deng et al. 2009) dataset accordingly. On the large-scale tasks (Image Net (Deng et al. 2009) clas- sification/COCO (Lin et al. 2014) detection) |
| Dataset Splits | Yes | We train networks on the training set and report the Top-1 and Top-5 accuracies on the validation set with single 224 224 central crop. All networks are trained from scratch by SGD (Sutskever et al. 2013) or Adam (Kingma and Ba 2014; Loshchilov and Hutter 2019). SGD is with weight decay 0.0001 and momentum 0.9 for 100 epochs, starting from learning rate 0.1 and decreasing it by a factor of 10 every 30 epochs. |
| Hardware Specification | No | The total batch size is set as 256 and 8 GPUs (32 images per GPU) are utilized for training. No specific GPU model or other hardware details were provided. |
| Software Dependencies | No | For a fair comparison, all experiments are run under a unified pytorch (Paszke et al. 2017) framework. No specific version numbers for software dependencies were provided. |
| Experiment Setup | Yes | The training settings are kept similar with (Li, Hu, and Yang 2019), except that we set the weight decay ratio λ to 0 for all the bias part in networks (He et al. 2019)... All networks are trained from scratch by SGD (Sutskever et al. 2013) or Adam (Kingma and Ba 2014; Loshchilov and Hutter 2019). SGD is with weight decay 0.0001 and momentum 0.9 for 100 epochs, starting from learning rate 0.1 and decreasing it by a factor of 10 every 30 epochs. Adam keeps the default settings with learning rate 0.001, β1 = 0.9, β2 = 0.999. The total batch size is set as 256 and 8 GPUs (32 images per GPU) are utilized for training. |