Decoupled Weight Decay Regularization
Authors: Ilya Loshchilov, Frank Hutter
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now evaluate the performance of decoupled weight decay under various training budgets and learning rate schedules. Our experimental setup follows that of Gastaldi (2017)...Figure 1: Adam performs better with decoupled weight decay (bottom row, Adam W) than with L2 regularization (top row, Adam). We show the final test error... |
| Researcher Affiliation | Academia | Ilya Loshchilov & Frank Hutter University of Freiburg Freiburg, Germany, ilya.loshchilov@gmail.com, fh@cs.uni-freiburg.de |
| Pseudocode | Yes | Algorithm 1 SGD with L2 regularization and SGD with decoupled weight decay (SGDW), both with momentum...Algorithm 2 Adam with L2 regularization and Adam with decoupled weight decay (Adam W) |
| Open Source Code | Yes | Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in Tensor Flow and Py Torch; the complete source code for our experiments is available at https://github.com/loshchil/AdamW-and-SGDW |
| Open Datasets | Yes | We also perform experiments on the Image Net32x32 dataset (Chrabaszcz et al., 2017), a downsampled version of the original Image Net dataset with 1.2 million 32 32 pixels images. |
| Dataset Splits | No | The paper mentions using CIFAR-10 and ImageNet32x32 datasets but does not explicitly provide details about the training, validation, and test splits (e.g., percentages, sample counts, or explicit references to a specific split methodology). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud instance specifications). |
| Software Dependencies | No | The paper mentions using 'fb.resnet.torch' and discusses implementations in 'Tensor Flow' and 'Py Torch', but it does not specify exact version numbers for these software components or any other libraries used. |
| Experiment Setup | Yes | We always used a batch size of 128. For each learning rate schedule and weight decay variant, we trained a 2x64d Res Net for 100 epochs, using different settings of the initial learning rate α and the weight decay factor λ. We fixed the initial learning rate to 0.001 which represents both the default learning rate for Adam and the one which showed reasonably good results in our experiments. |