Towards Calibrated Robust Fine-Tuning of Vision-Language Models

Authors: Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, Kyungwoo Song

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide extensive experimental results on Image Net distribution shift benchmarks that demonstrate the effectiveness of our theorem and its practical implementation. We first validate our new error bounds with synthetic datasets to show that the bounds hold empirically. Then, evaluate our method by conducting extensive experiments of fine-tuning CLIP [47] on Image Net-1K [10] classification task under natural distribution shift (Image Net-V2/R/A/Sketch and Object Net) and synthetic distribution shift (Image Net-C).
Researcher Affiliation Collaboration Changdae Oh ,c,n University of Wisconsin Madison Hyesu Lim ,c KAIST AI Mijoo Kimc Chung-Ang University Dongyoon Han NAVER AI Lab Sangdoo Yun NAVER AI Lab Jaegul Choo KAIST AI Alexander Hauptmann Carnegie Mellon University Zhi-Qi Cheng Carnegie Mellon University Kyungwoo Song Yonsei University
Pseudocode No The paper includes diagrams (e.g., Figure 2) and mathematical equations, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code Yes Our code is available here.
Open Datasets Yes We first validate our new error bounds with synthetic datasets to show that the bounds hold empirically. Then, evaluate our method by conducting extensive experiments of fine-tuning CLIP [47] on Image Net-1K [10] classification task under natural distribution shift (Image Net-V2/R/A/Sketch and Object Net) and synthetic distribution shift (Image Net-C). For downstream tasks, we consider the Image Net-1K (IN) classification and regard it as our ID domain. We consider IN-V2 [49], IN-R [23], IN-A [24], IN-S [60], and Object Net [3] as natural shifts of the in-distribution dataset (IN).
Dataset Splits Yes Temperature value was tuned for each method on the ID validation set based on the ECE value. We generate a binary classification dataset with 1000-dimensional Gaussian random variables as features where the mean of features are partly shifted across different test environments (ID, OOD). For the ID train set, the first 400 dimensions and the second 400 dimensions are correlated with labels, and the remaining 200 dimensions are zero-centered random noises.
Hardware Specification No The paper mentions 'Due to resource constraints, our research reached the scale of Vi T-L.' and 'Some parts of experiments are based on the NAVER Smart Machine Learning (NSML) platform [54].' but does not specify any particular GPU models, CPU types, or detailed computer specifications used for the experiments.
Software Dependencies No The paper mentions optimizing models using 'Adam W' as the optimizer but does not specify any software versions for frameworks (e.g., PyTorch, TensorFlow) or other libraries used in the implementation.
Experiment Setup Yes For all methods, we optimize the model parameters using the Adam W with a batch size of 512 over 10 epochs. We set the orthogonality constraint coefficient λOC as 0.2 and self-distillation coefficient λSD as 1.5, update frequency for EMA teacher as 500, and EMA final target momentum as 0.9. We linearly increased the EMA momentum α by 0.05 for the first 20% iterations.