Improving Subgroup Robustness via Data Selection

Authors: Saachi Jain, Kimia Hamidieh, Kristian Georgiev, Andrew Ilyas, Marzyeh Ghassemi, Aleksander Madry

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We consider four classification tasks where there is a spurious correlation between the target label and a group label in the training dataset: Celeb A-Age [26, 19], Celeb A-Blond [26], Waterbirds [41], and Multi NLI [50]. We provide more information about the datasets in Appendix B.1, and other experimental details in Appendix B.2. We first evaluate D3M and AUTO-D3M quantitatively, by measuring the worst-group accuracy of models trained on the selected subsets of the biased datasets above.
Researcher Affiliation Academia MIT {saachij,hamidieh,krisgrg,ailyas,mghassem,madry}@mit.edu
Pseudocode No The paper describes the steps of its methods (D3M, AUTO-D3M, Estimating the coefficients τ(z)) in numbered lists within the text but does not present them in a formally labeled 'Algorithm' or 'Pseudocode' block or figure.
Open Source Code Yes Answer: [Yes] Justification: we disclose all hyperparameters and also release our code in the supplement.
Open Datasets Yes We consider four classification tasks where there is a spurious correlation between the target label and a group label in the training dataset: Celeb A-Age [26, 19], Celeb A-Blond [26], Waterbirds [41], and Multi NLI [50].
Dataset Splits Yes Given a training dataset Strain and a (small) validation dataset Sval, the goal of the group robustness problem is to produce a classifier f that minimizes the worst-case loss over groups
Hardware Specification Yes Our model was trained on a machine with 8 A100 GPUs.
Software Dependencies No The paper mentions software components like 'Adam optimizer', 'Res Net-18', and specific implementations like 'Group DRO implementation by Sagawa et al. [40]' and 'DFR implementation by Kirichenko et al. [21]', but does not provide specific version numbers for general software dependencies (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes For the Celeb A dataset, we all methods with learning rate 1e 3, weight decay 1e 4, and batch size 512. We train RWG, SUBG, Group DRO and JTT with learning rate 1e 3, weight decay 1e 4, and batch size 512. We train all models for the Celeb A-Age task to up to 5 epochs and all models for Celeb A-Blond task up to 10 epochs. For the Waterbirds dataset, we train the approaches that use the ERM objective (including D3M) with learning rate 1e 4, weight decay 1e 4, and batch size 32. We train RWG, SUBG, Group DRO and JTT with learning rate 1e 5, weight decay 0.1, and batch size 32. We train all models to up to 20 epochs.