Examining and Combating Spurious Features under Distribution Shift
Authors: Chunting Zhou, Xuezhe Ma, Paul Michel, Graham Neubig
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on one image and two language tasks, we show that our model is significantly more robust than comparable baselines under various partitions. |
| Researcher Affiliation | Academia | 1Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA 2Information Sciences Institute, University of Southern California, Log Angeles, USA. |
| Pseudocode | Yes | Algorithm 1: Online greedy algorithm for GC-DRO. Input: α; β; m: #groups; ni: #samples of group i Initialize historical average group losses ˆL(0), historical estimate of group probabilities ˆptr(0), historical average instance losses ˆL(0) g and q(0)(x, y|g) = 1T for g {1, , m} for t = 1, , T do Sample a mini-batch (x, y, g) from Ptrain Perform online greedy updates for q(t)(Alg.2) Update model parameters θ di = niq(t)(gi)q(t)(x,y|gi) ˆptrain(t)(gi) ℓ(xi, yi; θ(t 1)) θ(t) = θ(t 1) η |B| P|B| i=1 di if reached inner update criterion then Update q(t)(x, y|g) for g = 1, , m do Sort instances in group g in the decreasing order of ℓ(x, y; θt); denote the sorted index πg cutoff= l (N ni)niβ q(t)((x, y)πg(j)|g) = 1 β , 1 j cutoff q(t)((x, y)πg(j)|g) = ni N , j > cutoff end end end |
| Open Source Code | Yes | Our code is available at https://github.com/violet-zct/ group-conditional-DRO. |
| Open Datasets | Yes | We use the Celeb A dataset (Liu et al., 2015) which has 162,770 training examples of celebrity faces. We use the Multi NLI dataset (Williams et al., 2018) and follow the train/dev/test split in Sagawa et al. (2020a), which results in 206,175 training examples. We perform experiments on the FDCL18 (Fortuna & Nunes, 2018) dataset, a corpus of 100k tweets annotated with four labels: Y = {hateful, spam, abusive and normal}. |
| Dataset Splits | Yes | Models are selected based on the worst-performing accuracy of group (of the clean partition) in the validation set. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided in the paper. |
| Software Dependencies | No | No specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment were provided in the paper. |
| Experiment Setup | Yes | We select hyperparameters by the robust validation accuracy. For the clean partitions, we set α = 0.2, β = 0.5 for all the three tasks. For the imperfect partitions, we set a relatively lower value of β to highlight badly performed instances within groups. Specifically, for NLP tasks we set α = 0.5, β = 0.2 and 0.25 for NLI and toxicity detection respectively, and for the image task, we set α = 0.2, β = 0.1. |