Three Mechanisms of Weight Decay Regularization
Authors: Guodong Zhang, Chaoqi Wang, Bowen Xu, Roger Grosse
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the input-output Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks. |
| Researcher Affiliation | Academia | Guodong Zhang, Chaoqi Wang, Bowen Xu, Roger Grosse University of Toronto, Vector Institute {gdzhang, cqwang, bowenxu, rgrosse}@cs.toronto.edu |
| Pseudocode | Yes | Algorithm 1 K-FAC with L2 regularization and K-FAC with weight decay. Subscript l denotes layers, wl = vec(Wl). We assume zero momentum for simplicity. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide any links to a code repository. |
| Open Datasets | Yes | Throughout the paper, we perform experiments on image classification with three different datasets, MNIST (Le Cun et al., 1998), CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009). |
| Dataset Splits | Yes | For each algorithm, best hyperparameters (learning rate and regularization factor) are selected using grid search on held-out 5k validation set. |
| Hardware Specification | No | No specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running experiments are provided. |
| Software Dependencies | No | No specific ancillary software details with version numbers (e.g., library or solver names with version numbers) are provided. |
| Experiment Setup | Yes | In default, batch size 128 is used unless stated otherwise. In SGD and Adam, we train the networks with a budge of 200 epochs and decay the learning rate by a factor of 10 every 60 epochs for batch sizes of 128 and 640, and every 80 epochs for the batch size of 2K. Whereas we train the networks only with 100 epochs and decay the learning rate every 40 epochs in K-FAC. Additionally, the curvature matrix is updated by running average with re-estimation every 10 iterations and the inverse operator is amortized to 100 iterations. For K-FAC, we use fixed damping term 1e 3 unless state otherwise. |