Three Mechanisms of Weight Decay Regularization

Authors: Guodong Zhang, Chaoqi Wang, Bowen Xu, Roger Grosse

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the input-output Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks.
Researcher Affiliation Academia Guodong Zhang, Chaoqi Wang, Bowen Xu, Roger Grosse University of Toronto, Vector Institute {gdzhang, cqwang, bowenxu, rgrosse}@cs.toronto.edu
Pseudocode Yes Algorithm 1 K-FAC with L2 regularization and K-FAC with weight decay. Subscript l denotes layers, wl = vec(Wl). We assume zero momentum for simplicity.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets Yes Throughout the paper, we perform experiments on image classification with three different datasets, MNIST (Le Cun et al., 1998), CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009).
Dataset Splits Yes For each algorithm, best hyperparameters (learning rate and regularization factor) are selected using grid search on held-out 5k validation set.
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running experiments are provided.
Software Dependencies No No specific ancillary software details with version numbers (e.g., library or solver names with version numbers) are provided.
Experiment Setup Yes In default, batch size 128 is used unless stated otherwise. In SGD and Adam, we train the networks with a budge of 200 epochs and decay the learning rate by a factor of 10 every 60 epochs for batch sizes of 128 and 640, and every 80 epochs for the batch size of 2K. Whereas we train the networks only with 100 epochs and decay the learning rate every 40 epochs in K-FAC. Additionally, the curvature matrix is updated by running average with re-estimation every 10 iterations and the inverse operator is amortized to 100 iterations. For K-FAC, we use fixed damping term 1e 3 unless state otherwise.