reproducibilityindex.ai

Improving Subgroup Robustness via Data Selection

Authors: Saachi Jain, Kimia Hamidieh, Kristian Georgiev, Andrew Ilyas, Marzyeh Ghassemi, Aleksander Madry

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We consider four classiﬁcation tasks where there is a spurious correlation between the target label and a group label in the training dataset: Celeb A-Age [26, 19], Celeb A-Blond [26], Waterbirds [41], and Multi NLI [50]. We provide more information about the datasets in Appendix B.1, and other experimental details in Appendix B.2. We ﬁrst evaluate D3M and AUTO-D3M quantitatively, by measuring the worst-group accuracy of models trained on the selected subsets of the biased datasets above.
Researcher Affiliation	Academia	MIT {saachij,hamidieh,krisgrg,ailyas,mghassem,madry}@mit.edu
Pseudocode	No	The paper describes the steps of its methods (D3M, AUTO-D3M, Estimating the coefficients τ(z)) in numbered lists within the text but does not present them in a formally labeled 'Algorithm' or 'Pseudocode' block or figure.
Open Source Code	Yes	Answer: [Yes] Justiﬁcation: we disclose all hyperparameters and also release our code in the supplement.
Open Datasets	Yes	We consider four classiﬁcation tasks where there is a spurious correlation between the target label and a group label in the training dataset: Celeb A-Age [26, 19], Celeb A-Blond [26], Waterbirds [41], and Multi NLI [50].
Dataset Splits	Yes	Given a training dataset Strain and a (small) validation dataset Sval, the goal of the group robustness problem is to produce a classiﬁer f that minimizes the worst-case loss over groups
Hardware Specification	Yes	Our model was trained on a machine with 8 A100 GPUs.
Software Dependencies	No	The paper mentions software components like 'Adam optimizer', 'Res Net-18', and specific implementations like 'Group DRO implementation by Sagawa et al. [40]' and 'DFR implementation by Kirichenko et al. [21]', but does not provide specific version numbers for general software dependencies (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	For the Celeb A dataset, we all methods with learning rate 1e 3, weight decay 1e 4, and batch size 512. We train RWG, SUBG, Group DRO and JTT with learning rate 1e 3, weight decay 1e 4, and batch size 512. We train all models for the Celeb A-Age task to up to 5 epochs and all models for Celeb A-Blond task up to 10 epochs. For the Waterbirds dataset, we train the approaches that use the ERM objective (including D3M) with learning rate 1e 4, weight decay 1e 4, and batch size 32. We train RWG, SUBG, Group DRO and JTT with learning rate 1e 5, weight decay 0.1, and batch size 32. We train all models to up to 20 epochs.