Decoupled Weight Decay Regularization

Authors: Ilya Loshchilov, Frank Hutter

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now evaluate the performance of decoupled weight decay under various training budgets and learning rate schedules. Our experimental setup follows that of Gastaldi (2017)...Figure 1: Adam performs better with decoupled weight decay (bottom row, Adam W) than with L2 regularization (top row, Adam). We show the final test error...
Researcher Affiliation Academia Ilya Loshchilov & Frank Hutter University of Freiburg Freiburg, Germany, ilya.loshchilov@gmail.com, fh@cs.uni-freiburg.de
Pseudocode Yes Algorithm 1 SGD with L2 regularization and SGD with decoupled weight decay (SGDW), both with momentum...Algorithm 2 Adam with L2 regularization and Adam with decoupled weight decay (Adam W)
Open Source Code Yes Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in Tensor Flow and Py Torch; the complete source code for our experiments is available at https://github.com/loshchil/AdamW-and-SGDW
Open Datasets Yes We also perform experiments on the Image Net32x32 dataset (Chrabaszcz et al., 2017), a downsampled version of the original Image Net dataset with 1.2 million 32 32 pixels images.
Dataset Splits No The paper mentions using CIFAR-10 and ImageNet32x32 datasets but does not explicitly provide details about the training, validation, and test splits (e.g., percentages, sample counts, or explicit references to a specific split methodology).
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud instance specifications).
Software Dependencies No The paper mentions using 'fb.resnet.torch' and discusses implementations in 'Tensor Flow' and 'Py Torch', but it does not specify exact version numbers for these software components or any other libraries used.
Experiment Setup Yes We always used a batch size of 128. For each learning rate schedule and weight decay variant, we trained a 2x64d Res Net for 100 epochs, using different settings of the initial learning rate α and the weight decay factor λ. We fixed the initial learning rate to 0.001 which represents both the default learning rate for Adam and the one which showed reasonably good results in our experiments.