Examining and Combating Spurious Features under Distribution Shift

Authors: Chunting Zhou, Xuezhe Ma, Paul Michel, Graham Neubig

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on one image and two language tasks, we show that our model is significantly more robust than comparable baselines under various partitions.
Researcher Affiliation Academia 1Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA 2Information Sciences Institute, University of Southern California, Log Angeles, USA.
Pseudocode Yes Algorithm 1: Online greedy algorithm for GC-DRO. Input: α; β; m: #groups; ni: #samples of group i Initialize historical average group losses ˆL(0), historical estimate of group probabilities ˆptr(0), historical average instance losses ˆL(0) g and q(0)(x, y|g) = 1T for g {1, , m} for t = 1, , T do Sample a mini-batch (x, y, g) from Ptrain Perform online greedy updates for q(t)(Alg.2) Update model parameters θ di = niq(t)(gi)q(t)(x,y|gi) ˆptrain(t)(gi) ℓ(xi, yi; θ(t 1)) θ(t) = θ(t 1) η |B| P|B| i=1 di if reached inner update criterion then Update q(t)(x, y|g) for g = 1, , m do Sort instances in group g in the decreasing order of ℓ(x, y; θt); denote the sorted index πg cutoff= l (N ni)niβ q(t)((x, y)πg(j)|g) = 1 β , 1 j cutoff q(t)((x, y)πg(j)|g) = ni N , j > cutoff end end end
Open Source Code Yes Our code is available at https://github.com/violet-zct/ group-conditional-DRO.
Open Datasets Yes We use the Celeb A dataset (Liu et al., 2015) which has 162,770 training examples of celebrity faces. We use the Multi NLI dataset (Williams et al., 2018) and follow the train/dev/test split in Sagawa et al. (2020a), which results in 206,175 training examples. We perform experiments on the FDCL18 (Fortuna & Nunes, 2018) dataset, a corpus of 100k tweets annotated with four labels: Y = {hateful, spam, abusive and normal}.
Dataset Splits Yes Models are selected based on the worst-performing accuracy of group (of the clean partition) in the validation set.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided in the paper.
Software Dependencies No No specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment were provided in the paper.
Experiment Setup Yes We select hyperparameters by the robust validation accuracy. For the clean partitions, we set α = 0.2, β = 0.5 for all the three tasks. For the imperfect partitions, we set a relatively lower value of β to highlight badly performed instances within groups. Specifically, for NLP tasks we set α = 0.5, β = 0.2 and 0.25 for NLI and toxicity detection respectively, and for the image task, we set α = 0.2, β = 0.1.