Why Do We Need Weight Decay in Modern Deep Learning?

Authors: Francesco D'Angelo, Maksym Andriushchenko, Aditya Vardhan Varre, Nicolas Flammarion

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our work delves into the mechanisms underlying the benefits of weight decay by training established machine learning models in both regimes: Res Net on popular vision tasks (over-training) and Transformer on text data (under-training).
Researcher Affiliation Academia Francesco D Angelo , Maksym Andriushchenko, Aditya Varre, Nicolas Flammarion Theory of Machine Learning Lab EPFL, Lausanne, Switzerland {francesco.dangelo,maksym.andriushchenko,aditya.varre,nicolas.flammarion}@epfl.ch
Pseudocode No The paper describes updates using mathematical equations (e.g., Eq. 1, Eq. 2, Eq. 4) but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/ tml-epfl/why-weight-decay
Open Datasets Yes We train a Res Net18 on subsets of the CIFAR-5m dataset (Nakkiran et al., 2020)...
Dataset Splits No The paper mentions using datasets like CIFAR-10, Tiny-ImageNet, and Open Web Text, and shows a 'Validation loss' plot, implying the use of a validation set. However, it does not provide specific details on the train/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction.
Hardware Specification Yes Each run requires approximately 2 GPU hours on an Nvidia A100 GPU.
Software Dependencies No The paper mentions using the 'Nano GPT repository' and discusses numerical precisions like 'bfloat16' and 'float32', but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We train a 124M parameter model (known as GPT-2-Small) for 50 000 iterations with a batch size of 256. ... we train with Adam W using the default LR 0.0006, a short 400-iteration LR warmup, gradient clipping with the ℓ2-threshold 1.0, and 10 cosine LR decay.