Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models
Authors: Junjiao Tian, Chengyue Huang, Zsolt Kira
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, when equipped with SPD, Adam consistently provides better in-distribution generalization and out-of-distribution robustness performance on multiple popular vision and language benchmarks. |
| Researcher Affiliation | Academia | Junjiao Tian Georgia Institute of Technology jtian73@gatech.edu Chengyue Huang Georgia Institute of Technology chuang475@gatech.edu Zsolt Kira Georgia Institute of Technology zkira@gatech.edu |
| Pseudocode | Yes | Alg. 1 shows the Adam optimizer with the L2-SP regularization in Eq. 1. The effects of the regularization are highlighted in blue, also shown in Eq. 2. Alg. 2 shows the proposed SPD. |
| Open Source Code | Yes | Code available at https://github.com/GT-RIPL/Selective-Projection-Decay.git. |
| Open Datasets | Yes | Image Classification. We first analyze the behavior of SPD on conventional image classification datasets Domain Net [22] and Image Net [23]. Semantic Segmentation. We further test SPD on the PASCAL-Context semantic segmentation dataset [29]. Common Sense Reasoning. Moreover, we show that SPD can benefit PEFT fine-tuning on large language models (LLMs). We use the Commonsense-170K dataset [34]. Visual Question Answering. Finally, we demonstrate SPD s superiority on multi-modal task. We use Google s recently released Pali Gemma [36] pretrained on a broad mixture of large-scale visionlanguage tasks. We fine-tune on VQAv2 [37] and test on nine OOD datasets using Lo RA [7]. |
| Dataset Splits | Yes | The regularization hyper-parameter is found through cross-validation, and the model with the best ID validation accuracy is taken. |
| Hardware Specification | Yes | We use 1 A40 GPU for each experiment. ... We use 2 A40 GPUs for each experiment. ... We use 4 2080Ti GPUs for each experiment. ... We use 1 A40 GPU for each experiment. ... We use 8 A40 GPU for each experiment. |
| Software Dependencies | No | The paper mentions using specific external repositories for training code (e.g., DEIT [46], prior work [30], prior work [34], LAVIS [56]) and standard augmentations (Mixup, Cutmix), along with optimizers (Adam W), but does not specify version numbers for these software components or underlying libraries like PyTorch/TensorFlow. |
| Experiment Setup | Yes | Standard augmentations are used for all: weight-decay (0.1), drop-path (0.2) [52], label-smoothing (0.1) [53], Mixup (0.8) [54] and Cutmix (1.0) [55]. The learning rate is 2e 5 and trained for 60 epochs for Tab. 1 and 30 epochs for Tab. 2. We use λ = 1 for all Adam-SPD results in Tab. 1. |