Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

Authors: Atli Kosson, Bettina Messmer, Martin Jaggi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. We experimentally validate that the results hold for NN training in practice.
Researcher Affiliation Academia 1EPFL, Switzerland.
Pseudocode Yes Algorithm 1 Rotational Wrapper for constrained dynamics.
Open Source Code No The paper uses and cites several open-source libraries (e.g., TIMM, Fair Seq, Nano GPT, LLM-Baselines) but does not explicitly state that the code for their described methodology is released or provide a link to it.
Open Datasets Yes We perform our experiments on several popular datasets, i.e., CIFAR-10/100 (Krizhevsky, 2009) and Imagenet-1k (Russakovsky et al., 2015) for image classification, IWSLT2014 (Cettolo et al., 2014) for German-English translation, and Wikitext (Merity et al., 2017) and Open Web Text (Radford et al., 2019) for language modelling.
Dataset Splits Yes For the sweep we train a Res Net-18 on a 90/10 train/val split from the original train set. We train on a random subset containing 90% of the train set and use the remaining 10% for validation which we report.
Hardware Specification Yes Most of the experiments are run on a single NVIDIA A100-SXM4-40GB GPU.
Software Dependencies No Our code utilizes the TIMM library (Wightman, 2019) for vision tasks, Fair Seq (Ott et al., 2019) for translation, and Nano GPT (Karpathy, 2023) and LLM-Baselines (Pagliardini, 2023) for language modelling. While these libraries are mentioned, specific version numbers for them or other core software dependencies are not provided.
Experiment Setup Yes Table 4: Experimental set up (include training set and test set definition) provides details on learning rate, warmup, epochs, schedule, precision, and specific weight decay and beta values for various optimizers and models.