Understanding Decoupled and Early Weight Decay

Authors: Johan Bjorck, Kilian Q. Weinberger, Carla Gomes6777-6785

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that by applying WD only at the start, the network norm stays small throughout training. This has a regularizing effect as the effective gradient updates become larger. However, traditional generalizations metrics fail to capture this effect of WD, and we show how a simple scale-invariant metric can. We also show how the growth of network weights is heavily influenced by the dataset and its generalization properties. For decoupled WD, we perform experiments in NLP and RL where adaptive optimizers are the norm.
Researcher Affiliation Academia Johan Bjorck, Kilian Q. Weinberger, Carla P. Gomes Cornell University {njb225,kqw4,gomes}@cornell.edu
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper uses publically available codebases like fairseq and dopamine, but does not state that the authors are releasing their own code for the methodology described.
Open Datasets Yes We replicate their experimental setup with identical hyperparameters (listed in the Appendix), training Resnet18 on Cifar10 and Resnet50 on Cifar100. We additionally provide experiments on tiny-imagenet [Karpathy, Li, and Johnson 2017 (accessed 2020-01-01] using densenet 121 [Huang et al. 2017]. We first consider translation of the IWSLT 14 German to English dataset [Cettolo et al. 2014]... Secondly, we also consider the RL agent DQN [Mnih et al. 2015], using the publically available dopamine codebase [Castro et al. 2018] with their default hyperparameters (see the Appendix), trained on a handful of Atari games...
Dataset Splits No The paper mentions using standard datasets like Cifar10/100, tiny-imagenet, IWSLT 14, and Atari games, but does not explicitly state train/validation/test splits or cross-validation methodologies in the main text.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper refers to using existing codebases like fairseq and dopamine, but does not list specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For investigating observations in Golatkar, Achille, and Soatto [2019] we replicate their experimental setup with identical hyperparameters (listed in the Appendix), training Resnet18 on Cifar10 and Resnet50 on Cifar100. and We consider λ {1e 3, 1e 4, 1e 5}, where the middle parameter is the default parameter used in fairseq, see the Appendix for all hyperparameters.