Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models

Authors: Junjiao Tian, Chengyue Huang, Zsolt Kira

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, when equipped with SPD, Adam consistently provides better in-distribution generalization and out-of-distribution robustness performance on multiple popular vision and language benchmarks.
Researcher Affiliation Academia Junjiao Tian Georgia Institute of Technology jtian73@gatech.edu Chengyue Huang Georgia Institute of Technology chuang475@gatech.edu Zsolt Kira Georgia Institute of Technology zkira@gatech.edu
Pseudocode Yes Alg. 1 shows the Adam optimizer with the L2-SP regularization in Eq. 1. The effects of the regularization are highlighted in blue, also shown in Eq. 2. Alg. 2 shows the proposed SPD.
Open Source Code Yes Code available at https://github.com/GT-RIPL/Selective-Projection-Decay.git.
Open Datasets Yes Image Classification. We first analyze the behavior of SPD on conventional image classification datasets Domain Net [22] and Image Net [23]. Semantic Segmentation. We further test SPD on the PASCAL-Context semantic segmentation dataset [29]. Common Sense Reasoning. Moreover, we show that SPD can benefit PEFT fine-tuning on large language models (LLMs). We use the Commonsense-170K dataset [34]. Visual Question Answering. Finally, we demonstrate SPD s superiority on multi-modal task. We use Google s recently released Pali Gemma [36] pretrained on a broad mixture of large-scale visionlanguage tasks. We fine-tune on VQAv2 [37] and test on nine OOD datasets using Lo RA [7].
Dataset Splits Yes The regularization hyper-parameter is found through cross-validation, and the model with the best ID validation accuracy is taken.
Hardware Specification Yes We use 1 A40 GPU for each experiment. ... We use 2 A40 GPUs for each experiment. ... We use 4 2080Ti GPUs for each experiment. ... We use 1 A40 GPU for each experiment. ... We use 8 A40 GPU for each experiment.
Software Dependencies No The paper mentions using specific external repositories for training code (e.g., DEIT [46], prior work [30], prior work [34], LAVIS [56]) and standard augmentations (Mixup, Cutmix), along with optimizers (Adam W), but does not specify version numbers for these software components or underlying libraries like PyTorch/TensorFlow.
Experiment Setup Yes Standard augmentations are used for all: weight-decay (0.1), drop-path (0.2) [52], label-smoothing (0.1) [53], Mixup (0.8) [54] and Cutmix (1.0) [55]. The learning rate is 2e 5 and trained for 60 epochs for Tab. 1 and 30 epochs for Tab. 2. We use λ = 1 for all Adam-SPD results in Tab. 1.