reproducibilityindex.ai

Decoupled Weight Decay Regularization

Authors: Ilya Loshchilov, Frank Hutter

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now evaluate the performance of decoupled weight decay under various training budgets and learning rate schedules. Our experimental setup follows that of Gastaldi (2017)...Figure 1: Adam performs better with decoupled weight decay (bottom row, Adam W) than with L2 regularization (top row, Adam). We show the ﬁnal test error...
Researcher Affiliation	Academia	Ilya Loshchilov & Frank Hutter University of Freiburg Freiburg, Germany, ilya.loshchilov@gmail.com, fh@cs.uni-freiburg.de
Pseudocode	Yes	Algorithm 1 SGD with L2 regularization and SGD with decoupled weight decay (SGDW), both with momentum...Algorithm 2 Adam with L2 regularization and Adam with decoupled weight decay (Adam W)
Open Source Code	Yes	Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in Tensor Flow and Py Torch; the complete source code for our experiments is available at https://github.com/loshchil/AdamW-and-SGDW
Open Datasets	Yes	We also perform experiments on the Image Net32x32 dataset (Chrabaszcz et al., 2017), a downsampled version of the original Image Net dataset with 1.2 million 32 32 pixels images.
Dataset Splits	No	The paper mentions using CIFAR-10 and ImageNet32x32 datasets but does not explicitly provide details about the training, validation, and test splits (e.g., percentages, sample counts, or explicit references to a specific split methodology).
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud instance specifications).
Software Dependencies	No	The paper mentions using 'fb.resnet.torch' and discusses implementations in 'Tensor Flow' and 'Py Torch', but it does not specify exact version numbers for these software components or any other libraries used.
Experiment Setup	Yes	We always used a batch size of 128. For each learning rate schedule and weight decay variant, we trained a 2x64d Res Net for 100 epochs, using different settings of the initial learning rate α and the weight decay factor λ. We ﬁxed the initial learning rate to 0.001 which represents both the default learning rate for Adam and the one which showed reasonably good results in our experiments.