Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Examining and Combating Spurious Features under Distribution Shift
Authors: Chunting Zhou, Xuezhe Ma, Paul Michel, Graham Neubig
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on one image and two language tasks, we show that our model is significantly more robust than comparable baselines under various partitions. |
| Researcher Affiliation | Academia | 1Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA 2Information Sciences Institute, University of Southern California, Log Angeles, USA. |
| Pseudocode | Yes | Algorithm 1: Online greedy algorithm for GC-DRO. Input: α; β; m: #groups; ni: #samples of group i Initialize historical average group losses ˆL(0), historical estimate of group probabilities ˆptr(0), historical average instance losses ˆL(0) g and q(0)(x, y|g) = 1T for g {1, , m} for t = 1, , T do Sample a mini-batch (x, y, g) from Ptrain Perform online greedy updates for q(t)(Alg.2) Update model parameters θ di = niq(t)(gi)q(t)(x,y|gi) ˆptrain(t)(gi) ℓ(xi, yi; θ(t 1)) θ(t) = θ(t 1) η |B| P|B| i=1 di if reached inner update criterion then Update q(t)(x, y|g) for g = 1, , m do Sort instances in group g in the decreasing order of ℓ(x, y; θt); denote the sorted index πg cutoff= l (N ni)niβ q(t)((x, y)πg(j)|g) = 1 β , 1 j cutoff q(t)((x, y)πg(j)|g) = ni N , j > cutoff end end end |
| Open Source Code | Yes | Our code is available at https://github.com/violet-zct/ group-conditional-DRO. |
| Open Datasets | Yes | We use the Celeb A dataset (Liu et al., 2015) which has 162,770 training examples of celebrity faces. We use the Multi NLI dataset (Williams et al., 2018) and follow the train/dev/test split in Sagawa et al. (2020a), which results in 206,175 training examples. We perform experiments on the FDCL18 (Fortuna & Nunes, 2018) dataset, a corpus of 100k tweets annotated with four labels: Y = {hateful, spam, abusive and normal}. |
| Dataset Splits | Yes | Models are selected based on the worst-performing accuracy of group (of the clean partition) in the validation set. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided in the paper. |
| Software Dependencies | No | No specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment were provided in the paper. |
| Experiment Setup | Yes | We select hyperparameters by the robust validation accuracy. For the clean partitions, we set α = 0.2, β = 0.5 for all the three tasks. For the imperfect partitions, we set a relatively lower value of β to highlight badly performed instances within groups. Specifically, for NLP tasks we set α = 0.5, β = 0.2 and 0.25 for NLI and toxicity detection respectively, and for the image task, we set α = 0.2, β = 0.1. |