reproducibilityindex.ai

Examining and Combating Spurious Features under Distribution Shift

Authors: Chunting Zhou, Xuezhe Ma, Paul Michel, Graham Neubig

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on one image and two language tasks, we show that our model is significantly more robust than comparable baselines under various partitions.
Researcher Affiliation	Academia	1Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA 2Information Sciences Institute, University of Southern California, Log Angeles, USA.
Pseudocode	Yes	Algorithm 1: Online greedy algorithm for GC-DRO. Input: α; β; m: #groups; ni: #samples of group i Initialize historical average group losses ˆL(0), historical estimate of group probabilities ˆptr(0), historical average instance losses ˆL(0) g and q(0)(x, y\|g) = 1T for g {1, , m} for t = 1, , T do Sample a mini-batch (x, y, g) from Ptrain Perform online greedy updates for q(t)(Alg.2) Update model parameters θ di = niq(t)(gi)q(t)(x,y\|gi) ˆptrain(t)(gi) ℓ(xi, yi; θ(t 1)) θ(t) = θ(t 1) η \|B\| P\|B\| i=1 di if reached inner update criterion then Update q(t)(x, y\|g) for g = 1, , m do Sort instances in group g in the decreasing order of ℓ(x, y; θt); denote the sorted index πg cutoﬀ= l (N ni)niβ q(t)((x, y)πg(j)\|g) = 1 β , 1 j cutoﬀ q(t)((x, y)πg(j)\|g) = ni N , j > cutoﬀ end end end
Open Source Code	Yes	Our code is available at https://github.com/violet-zct/ group-conditional-DRO.
Open Datasets	Yes	We use the Celeb A dataset (Liu et al., 2015) which has 162,770 training examples of celebrity faces. We use the Multi NLI dataset (Williams et al., 2018) and follow the train/dev/test split in Sagawa et al. (2020a), which results in 206,175 training examples. We perform experiments on the FDCL18 (Fortuna & Nunes, 2018) dataset, a corpus of 100k tweets annotated with four labels: Y = {hateful, spam, abusive and normal}.
Dataset Splits	Yes	Models are selected based on the worst-performing accuracy of group (of the clean partition) in the validation set.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided in the paper.
Software Dependencies	No	No specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment were provided in the paper.
Experiment Setup	Yes	We select hyperparameters by the robust validation accuracy. For the clean partitions, we set α = 0.2, β = 0.5 for all the three tasks. For the imperfect partitions, we set a relatively lower value of β to highlight badly performed instances within groups. Specifically, for NLP tasks we set α = 0.5, β = 0.2 and 0.25 for NLI and toxicity detection respectively, and for the image task, we set α = 0.2, β = 0.1.