Robust Fine-tuning of Zero-shot Models via Variance Reduction

Authors: Beier Zhu, Jiequan Cui, Hanwang Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On Image Net and five derived distribution shifts, our VRF further improves the OOD accuracy by 1.5 2.0 pp over the ensemble baselines while maintaining or increasing ID accuracy. VRF achieves similar large robustness gains (0.9 3.1 pp) on other distribution shifts benchmarks.
Researcher Affiliation Academia Beier Zhu Jiequan Cui Hanwang Zhang Nanyang Technological University beier002@e.ntu.edu.sg, hanwangzhang@ntu.edu.sg
Pseudocode Yes Algorithm 1 Variation Reduction Fine-tuning
Open Source Code Yes Codes are available in https://github.com/Beier Zhu/VRF.
Open Datasets Yes CIFAR-10 [13]: MIT License, https://www.cs.toronto.edu/~kriz/cifar.html. STL-10 [2]: Non-commercial, https://cs.stanford.edu/~acoates/stl10/. Entity-30 [23]: Non-commercial, https://github.com/Madry Lab/BREEDS-Benchmarks. Image Net [3]: Non-commercial, http://image-net.org. IN-V2 [21]: MIT License, https://github.com/modestyachts/Image Net V2. IN-R [7]: MIT License, https://github.com/hendrycks/imagenet-r. IN-Sketch [27]: MIT License, https://github.com/Haohan Wang/Image Net-Sketch. IN-A [9]: MIT License, https://github.com/hendrycks/natural-adv-examples. Object Net [1]: Creative Commons Attribution 4.0, https://objectnet.dev.
Dataset Splits Yes Note that all the hyperparameters, e.g., α, a, b, are searched using the accuracy on the in-distribution (ID) validation set. Derived distribution shift datasets are only for evaluation and not for hyperparameter sweeps.
Hardware Specification Yes The batch size for training CLIP Vi T-16 based LP-FT models is set to 384, which is the largest batch size that fits into 2 A6000 GPUs.
Software Dependencies No When fine-tuning E2E-FT models, we adhere to Wortsman et al. [28], employing the default Py Torch Adam W optimizer for 10 epochs with weight decay of 0.1 and a cosine-annealing learning rate schedule with 500 warm-up steps. Unless specified, we use a learning rate of 3 10 5, gradient clipping at norm 1. When fine-tuning LP-FT, we first adopt the settings of Wortsman et al. [28] to train the linear classifier, then full fine-tune the models at a learning rate of 1 10 5. For efficiently performing k-NN search, we use Faiss library [11]. No specific version numbers for PyTorch, AdamW, or Faiss are explicitly stated.
Experiment Setup Yes When fine-tuning E2E-FT models, we adhere to Wortsman et al. [28], employing the default Py Torch Adam W optimizer for 10 epochs with weight decay of 0.1 and a cosine-annealing learning rate schedule with 500 warm-up steps. Unless specified, we use a learning rate of 3 10 5, gradient clipping at norm 1.