reproducibilityindex.ai

Why Do We Need Weight Decay in Modern Deep Learning?

Authors: Francesco D'Angelo, Maksym Andriushchenko, Aditya Vardhan Varre, Nicolas Flammarion

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our work delves into the mechanisms underlying the benefits of weight decay by training established machine learning models in both regimes: Res Net on popular vision tasks (over-training) and Transformer on text data (under-training).
Researcher Affiliation	Academia	Francesco D Angelo , Maksym Andriushchenko, Aditya Varre, Nicolas Flammarion Theory of Machine Learning Lab EPFL, Lausanne, Switzerland {francesco.dangelo,maksym.andriushchenko,aditya.varre,nicolas.flammarion}@epfl.ch
Pseudocode	No	The paper describes updates using mathematical equations (e.g., Eq. 1, Eq. 2, Eq. 4) but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/ tml-epfl/why-weight-decay
Open Datasets	Yes	We train a Res Net18 on subsets of the CIFAR-5m dataset (Nakkiran et al., 2020)...
Dataset Splits	No	The paper mentions using datasets like CIFAR-10, Tiny-ImageNet, and Open Web Text, and shows a 'Validation loss' plot, implying the use of a validation set. However, it does not provide specific details on the train/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction.
Hardware Specification	Yes	Each run requires approximately 2 GPU hours on an Nvidia A100 GPU.
Software Dependencies	No	The paper mentions using the 'Nano GPT repository' and discusses numerical precisions like 'bfloat16' and 'float32', but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We train a 124M parameter model (known as GPT-2-Small) for 50 000 iterations with a batch size of 256. ... we train with Adam W using the default LR 0.0006, a short 400-iteration LR warmup, gradient clipping with the ℓ2-threshold 1.0, and 10 cosine LR decay.