Improving Subgroup Robustness via Data Selection
Authors: Saachi Jain, Kimia Hamidieh, Kristian Georgiev, Andrew Ilyas, Marzyeh Ghassemi, Aleksander Madry
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We consider four classification tasks where there is a spurious correlation between the target label and a group label in the training dataset: Celeb A-Age [26, 19], Celeb A-Blond [26], Waterbirds [41], and Multi NLI [50]. We provide more information about the datasets in Appendix B.1, and other experimental details in Appendix B.2. We first evaluate D3M and AUTO-D3M quantitatively, by measuring the worst-group accuracy of models trained on the selected subsets of the biased datasets above. |
| Researcher Affiliation | Academia | MIT {saachij,hamidieh,krisgrg,ailyas,mghassem,madry}@mit.edu |
| Pseudocode | No | The paper describes the steps of its methods (D3M, AUTO-D3M, Estimating the coefficients τ(z)) in numbered lists within the text but does not present them in a formally labeled 'Algorithm' or 'Pseudocode' block or figure. |
| Open Source Code | Yes | Answer: [Yes] Justification: we disclose all hyperparameters and also release our code in the supplement. |
| Open Datasets | Yes | We consider four classification tasks where there is a spurious correlation between the target label and a group label in the training dataset: Celeb A-Age [26, 19], Celeb A-Blond [26], Waterbirds [41], and Multi NLI [50]. |
| Dataset Splits | Yes | Given a training dataset Strain and a (small) validation dataset Sval, the goal of the group robustness problem is to produce a classifier f that minimizes the worst-case loss over groups |
| Hardware Specification | Yes | Our model was trained on a machine with 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer', 'Res Net-18', and specific implementations like 'Group DRO implementation by Sagawa et al. [40]' and 'DFR implementation by Kirichenko et al. [21]', but does not provide specific version numbers for general software dependencies (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | For the Celeb A dataset, we all methods with learning rate 1e 3, weight decay 1e 4, and batch size 512. We train RWG, SUBG, Group DRO and JTT with learning rate 1e 3, weight decay 1e 4, and batch size 512. We train all models for the Celeb A-Age task to up to 5 epochs and all models for Celeb A-Blond task up to 10 epochs. For the Waterbirds dataset, we train the approaches that use the ERM objective (including D3M) with learning rate 1e 4, weight decay 1e 4, and batch size 32. We train RWG, SUBG, Group DRO and JTT with learning rate 1e 5, weight decay 0.1, and batch size 32. We train all models to up to 20 epochs. |