Robust Fine-tuning of Zero-shot Models via Variance Reduction
Authors: Beier Zhu, Jiequan Cui, Hanwang Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On Image Net and five derived distribution shifts, our VRF further improves the OOD accuracy by 1.5 2.0 pp over the ensemble baselines while maintaining or increasing ID accuracy. VRF achieves similar large robustness gains (0.9 3.1 pp) on other distribution shifts benchmarks. |
| Researcher Affiliation | Academia | Beier Zhu Jiequan Cui Hanwang Zhang Nanyang Technological University beier002@e.ntu.edu.sg, hanwangzhang@ntu.edu.sg |
| Pseudocode | Yes | Algorithm 1 Variation Reduction Fine-tuning |
| Open Source Code | Yes | Codes are available in https://github.com/Beier Zhu/VRF. |
| Open Datasets | Yes | CIFAR-10 [13]: MIT License, https://www.cs.toronto.edu/~kriz/cifar.html. STL-10 [2]: Non-commercial, https://cs.stanford.edu/~acoates/stl10/. Entity-30 [23]: Non-commercial, https://github.com/Madry Lab/BREEDS-Benchmarks. Image Net [3]: Non-commercial, http://image-net.org. IN-V2 [21]: MIT License, https://github.com/modestyachts/Image Net V2. IN-R [7]: MIT License, https://github.com/hendrycks/imagenet-r. IN-Sketch [27]: MIT License, https://github.com/Haohan Wang/Image Net-Sketch. IN-A [9]: MIT License, https://github.com/hendrycks/natural-adv-examples. Object Net [1]: Creative Commons Attribution 4.0, https://objectnet.dev. |
| Dataset Splits | Yes | Note that all the hyperparameters, e.g., α, a, b, are searched using the accuracy on the in-distribution (ID) validation set. Derived distribution shift datasets are only for evaluation and not for hyperparameter sweeps. |
| Hardware Specification | Yes | The batch size for training CLIP Vi T-16 based LP-FT models is set to 384, which is the largest batch size that fits into 2 A6000 GPUs. |
| Software Dependencies | No | When fine-tuning E2E-FT models, we adhere to Wortsman et al. [28], employing the default Py Torch Adam W optimizer for 10 epochs with weight decay of 0.1 and a cosine-annealing learning rate schedule with 500 warm-up steps. Unless specified, we use a learning rate of 3 10 5, gradient clipping at norm 1. When fine-tuning LP-FT, we first adopt the settings of Wortsman et al. [28] to train the linear classifier, then full fine-tune the models at a learning rate of 1 10 5. For efficiently performing k-NN search, we use Faiss library [11]. No specific version numbers for PyTorch, AdamW, or Faiss are explicitly stated. |
| Experiment Setup | Yes | When fine-tuning E2E-FT models, we adhere to Wortsman et al. [28], employing the default Py Torch Adam W optimizer for 10 epochs with weight decay of 0.1 and a cosine-annealing learning rate schedule with 500 warm-up steps. Unless specified, we use a learning rate of 3 10 5, gradient clipping at norm 1. |