reproducibilityindex.ai

Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning

Authors: Yu Yang, Besmira Nushi, Hamid Palangi, Baharan Mirzasoleiman

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results and in-depth visualizations on CLIP show that such an intervention can effectively i) improve the model s accuracy when spurious attributes are not present, and ii) directs the model s activation maps towards the actual class rather than the spurious attribute when present. In particular, on the Waterbirds dataset, our algorithm achieved a worst-group accuracy 23% higher than ERM on CLIP with a Res Net-50 backbone, and 32% higher on CLIP with a Vi T backbone, while maintaining the same average accuracy as ERM1.
Researcher Affiliation	Collaboration	1Department of Computer Science, University of California, Los Angeles, USA 2Microsoft Research, Redmond, USA.
Pseudocode	No	The paper describes mathematical formulations of loss functions but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Code can be found at https://github.com/bigml-cs-ucla/clipspurious-finetune
Open Datasets	Yes	Waterbirds (Sagawa et al., 2019) is the most commonly used benchmark dataset for studying spurious correlations. It combines birds segmented from the CUB dataset (Wah et al., 2011) and the background in dataset (Zhou et al., 2017) in an imbalanced way such that the background can be used as a spurious attribute for bird classification. Image Net-1K. Singla et al. (2021) found that some features are spuriously correlated with some categories in Image Net1K (Russakovsky et al., 2015).
Dataset Splits	Yes	Note that since the validation set for Imagen Net contains only 50 images per class, we run the spurious correlation detection and evaluation stages on the training data instead, while mitigation results are presented for the test data. ...Both average and worst-group performance are evaluated with models early stopped at the highest worst-group accuracy on the validation set.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models. It only mentions general terms like
Software Dependencies	No	The paper mentions software components like
Experiment Setup	Yes	We used the SGD optimizer for all the experiemnts, and tuned the learning rates and weight decays for ERM, Group DRO and CLIP-based loss (CLIPfinetuning and our method) separately. Our method uses learning rate 1e-5 with weight decay 1e-4.