Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models
Authors: Junjiao Tian, Chengyue Huang, Zsolt Kira
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, when equipped with SPD, Adam consistently provides better in-distribution generalization and out-of-distribution robustness performance on multiple popular vision and language benchmarks. |
| Researcher Affiliation | Academia | Junjiao Tian Georgia Institute of Technology EMAIL Chengyue Huang Georgia Institute of Technology EMAIL Zsolt Kira Georgia Institute of Technology EMAIL |
| Pseudocode | Yes | Alg. 1 shows the Adam optimizer with the L2-SP regularization in Eq. 1. The effects of the regularization are highlighted in blue, also shown in Eq. 2. Alg. 2 shows the proposed SPD. |
| Open Source Code | Yes | Code available at https://github.com/GT-RIPL/Selective-Projection-Decay.git. |
| Open Datasets | Yes | Image Classification. We first analyze the behavior of SPD on conventional image classification datasets Domain Net [22] and Image Net [23]. Semantic Segmentation. We further test SPD on the PASCAL-Context semantic segmentation dataset [29]. Common Sense Reasoning. Moreover, we show that SPD can benefit PEFT fine-tuning on large language models (LLMs). We use the Commonsense-170K dataset [34]. Visual Question Answering. Finally, we demonstrate SPD s superiority on multi-modal task. We use Google s recently released Pali Gemma [36] pretrained on a broad mixture of large-scale visionlanguage tasks. We fine-tune on VQAv2 [37] and test on nine OOD datasets using Lo RA [7]. |
| Dataset Splits | Yes | The regularization hyper-parameter is found through cross-validation, and the model with the best ID validation accuracy is taken. |
| Hardware Specification | Yes | We use 1 A40 GPU for each experiment. ... We use 2 A40 GPUs for each experiment. ... We use 4 2080Ti GPUs for each experiment. ... We use 1 A40 GPU for each experiment. ... We use 8 A40 GPU for each experiment. |
| Software Dependencies | No | The paper mentions using specific external repositories for training code (e.g., DEIT [46], prior work [30], prior work [34], LAVIS [56]) and standard augmentations (Mixup, Cutmix), along with optimizers (Adam W), but does not specify version numbers for these software components or underlying libraries like PyTorch/TensorFlow. |
| Experiment Setup | Yes | Standard augmentations are used for all: weight-decay (0.1), drop-path (0.2) [52], label-smoothing (0.1) [53], Mixup (0.8) [54] and Cutmix (1.0) [55]. The learning rate is 2e 5 and trained for 60 epochs for Tab. 1 and 30 epochs for Tab. 2. We use λ = 1 for all Adam-SPD results in Tab. 1. |