reproducibilityindex.ai

Three Mechanisms of Weight Decay Regularization

Authors: Guodong Zhang, Chaoqi Wang, Bowen Xu, Roger Grosse

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the input-output Jacobian norm, and (3) reducing the effective damping coefﬁcient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks.
Researcher Affiliation	Academia	Guodong Zhang, Chaoqi Wang, Bowen Xu, Roger Grosse University of Toronto, Vector Institute {gdzhang, cqwang, bowenxu, rgrosse}@cs.toronto.edu
Pseudocode	Yes	Algorithm 1 K-FAC with L2 regularization and K-FAC with weight decay. Subscript l denotes layers, wl = vec(Wl). We assume zero momentum for simplicity.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets	Yes	Throughout the paper, we perform experiments on image classiﬁcation with three different datasets, MNIST (Le Cun et al., 1998), CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009).
Dataset Splits	Yes	For each algorithm, best hyperparameters (learning rate and regularization factor) are selected using grid search on held-out 5k validation set.
Hardware Specification	No	No specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running experiments are provided.
Software Dependencies	No	No specific ancillary software details with version numbers (e.g., library or solver names with version numbers) are provided.
Experiment Setup	Yes	In default, batch size 128 is used unless stated otherwise. In SGD and Adam, we train the networks with a budge of 200 epochs and decay the learning rate by a factor of 10 every 60 epochs for batch sizes of 128 and 640, and every 80 epochs for the batch size of 2K. Whereas we train the networks only with 100 epochs and decay the learning rate every 40 epochs in K-FAC. Additionally, the curvature matrix is updated by running average with re-estimation every 10 iterations and the inverse operator is amortized to 100 iterations. For K-FAC, we use ﬁxed damping term 1e 3 unless state otherwise.