Zero-Shot Robustification of Zero-Shot Models

Authors: Dyah Adila, Changho Shin, Linrong Cai, Frederic Sala

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we evaluate ROBOSHOT on nine image and NLP classification tasks and show an average improvement of 15.98% on worst group accuracy, with trivial decrease in overall accuracy over several zero-shot baselines. ... A simple theoretical model describing zero-shot failures along with a theoretical analysis of our approach... Extensive experimental evidence on zero-shot language and multimodal models, showing improved worst-group accuracy of 15.98% across nine image and NLP datasets
Researcher Affiliation Academia Dyah Adila , Changho Shin , Linrong Cai, Frederic Sala Department of Computer Science University of Wisconsin-Madison {adila,cshin23,lcai54,fredsala}@wisc.edu
Pseudocode Yes Algorithm 1: ROBOSHOT
Open Source Code Yes 1Code can be found in https://github.com/Sprocket Lab/roboshot
Open Datasets Yes We experimented on five binary and multi-class datasets with spurious correlations and distribution shifts: Waterbirds (Sagawa et al., 2019), Celeb A (Liu et al., 2015), CXR14 (Wang et al., 2017), PACS (Li et al., 2017), and VLCS (Fang et al., 2013). ... We experimented on four text classification datasets: Civil Comments-WILDS (Borkan et al., 2019; Koh et al., 2021), Hate Xplain (Mathew et al., 2021), Amazon-WILDS (Ni et al., 2019; Koh et al., 2021) and Gender Bias classification dataset (Dinan et al., 2020; Miller et al., 2017).
Dataset Splits Yes Table 3 shows results from using only 100 random validation samples (LFA 100 val) and the full validation set (LFA). We use WILDS (Koh et al., 2021) default splits in Waterbirds and Celeb A, and randomly shuffle 70:20:10 train:test:validation splits in PACS and VLCS.
Hardware Specification Yes The training was conducted using two NVIDIA RTX A4000 GPUs
Software Dependencies No All ROBOSHOT experiments are carried out using frozen weights and embeddings from huggingface (ALIGN, Alt CLIP) and open-clip (CLIP Vi T-B-32 and Vi T-L-14, Biomed CLIP)... We use SGD optimizer with fixed default momentum form Py Torch.
Experiment Setup Yes We report the hyperparameter choices in Appendix D.5. ... Table 12 shows the choices of hyperparameters we tune over for LFA experiments. We use SGD optimizer with fixed default momentum form Py Torch.