Why Do We Need Weight Decay in Modern Deep Learning?
Authors: Francesco D'Angelo, Maksym Andriushchenko, Aditya Vardhan Varre, Nicolas Flammarion
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work delves into the mechanisms underlying the benefits of weight decay by training established machine learning models in both regimes: Res Net on popular vision tasks (over-training) and Transformer on text data (under-training). |
| Researcher Affiliation | Academia | Francesco D Angelo , Maksym Andriushchenko, Aditya Varre, Nicolas Flammarion Theory of Machine Learning Lab EPFL, Lausanne, Switzerland {francesco.dangelo,maksym.andriushchenko,aditya.varre,nicolas.flammarion}@epfl.ch |
| Pseudocode | No | The paper describes updates using mathematical equations (e.g., Eq. 1, Eq. 2, Eq. 4) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/ tml-epfl/why-weight-decay |
| Open Datasets | Yes | We train a Res Net18 on subsets of the CIFAR-5m dataset (Nakkiran et al., 2020)... |
| Dataset Splits | No | The paper mentions using datasets like CIFAR-10, Tiny-ImageNet, and Open Web Text, and shows a 'Validation loss' plot, implying the use of a validation set. However, it does not provide specific details on the train/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction. |
| Hardware Specification | Yes | Each run requires approximately 2 GPU hours on an Nvidia A100 GPU. |
| Software Dependencies | No | The paper mentions using the 'Nano GPT repository' and discusses numerical precisions like 'bfloat16' and 'float32', but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We train a 124M parameter model (known as GPT-2-Small) for 50 000 iterations with a batch size of 256. ... we train with Adam W using the default LR 0.0006, a short 400-iteration LR warmup, gradient clipping with the ℓ2-threshold 1.0, and 10 cosine LR decay. |