Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Zero-Shot Robustification of Zero-Shot Models

Authors: Dyah Adila, Changho Shin, Linrong Cai, Frederic Sala

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we evaluate ROBOSHOT on nine image and NLP classification tasks and show an average improvement of 15.98% on worst group accuracy, with trivial decrease in overall accuracy over several zero-shot baselines. ... A simple theoretical model describing zero-shot failures along with a theoretical analysis of our approach... Extensive experimental evidence on zero-shot language and multimodal models, showing improved worst-group accuracy of 15.98% across nine image and NLP datasets
Researcher Affiliation Academia Dyah Adila , Changho Shin , Linrong Cai, Frederic Sala Department of Computer Science University of Wisconsin-Madison EMAIL
Pseudocode Yes Algorithm 1: ROBOSHOT
Open Source Code Yes 1Code can be found in https://github.com/Sprocket Lab/roboshot
Open Datasets Yes We experimented on five binary and multi-class datasets with spurious correlations and distribution shifts: Waterbirds (Sagawa et al., 2019), Celeb A (Liu et al., 2015), CXR14 (Wang et al., 2017), PACS (Li et al., 2017), and VLCS (Fang et al., 2013). ... We experimented on four text classification datasets: Civil Comments-WILDS (Borkan et al., 2019; Koh et al., 2021), Hate Xplain (Mathew et al., 2021), Amazon-WILDS (Ni et al., 2019; Koh et al., 2021) and Gender Bias classification dataset (Dinan et al., 2020; Miller et al., 2017).
Dataset Splits Yes Table 3 shows results from using only 100 random validation samples (LFA 100 val) and the full validation set (LFA). We use WILDS (Koh et al., 2021) default splits in Waterbirds and Celeb A, and randomly shuffle 70:20:10 train:test:validation splits in PACS and VLCS.
Hardware Specification Yes The training was conducted using two NVIDIA RTX A4000 GPUs
Software Dependencies No All ROBOSHOT experiments are carried out using frozen weights and embeddings from huggingface (ALIGN, Alt CLIP) and open-clip (CLIP Vi T-B-32 and Vi T-L-14, Biomed CLIP)... We use SGD optimizer with fixed default momentum form Py Torch.
Experiment Setup Yes We report the hyperparameter choices in Appendix D.5. ... Table 12 shows the choices of hyperparameters we tune over for LFA experiments. We use SGD optimizer with fixed default momentum form Py Torch.