Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning
Authors: Yu Yang, Besmira Nushi, Hamid Palangi, Baharan Mirzasoleiman
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results and in-depth visualizations on CLIP show that such an intervention can effectively i) improve the model s accuracy when spurious attributes are not present, and ii) directs the model s activation maps towards the actual class rather than the spurious attribute when present. In particular, on the Waterbirds dataset, our algorithm achieved a worst-group accuracy 23% higher than ERM on CLIP with a Res Net-50 backbone, and 32% higher on CLIP with a Vi T backbone, while maintaining the same average accuracy as ERM1. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of California, Los Angeles, USA 2Microsoft Research, Redmond, USA. |
| Pseudocode | No | The paper describes mathematical formulations of loss functions but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code can be found at https://github.com/bigml-cs-ucla/clipspurious-finetune |
| Open Datasets | Yes | Waterbirds (Sagawa et al., 2019) is the most commonly used benchmark dataset for studying spurious correlations. It combines birds segmented from the CUB dataset (Wah et al., 2011) and the background in dataset (Zhou et al., 2017) in an imbalanced way such that the background can be used as a spurious attribute for bird classification. Image Net-1K. Singla et al. (2021) found that some features are spuriously correlated with some categories in Image Net1K (Russakovsky et al., 2015). |
| Dataset Splits | Yes | Note that since the validation set for Imagen Net contains only 50 images per class, we run the spurious correlation detection and evaluation stages on the training data instead, while mitigation results are presented for the test data. ...Both average and worst-group performance are evaluated with models early stopped at the highest worst-group accuracy on the validation set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models. It only mentions general terms like |
| Software Dependencies | No | The paper mentions software components like |
| Experiment Setup | Yes | We used the SGD optimizer for all the experiemnts, and tuned the learning rates and weight decays for ERM, Group DRO and CLIP-based loss (CLIPfinetuning and our method) separately. Our method uses learning rate 1e-5 with weight decay 1e-4. |