Distributionally Robust Neural Networks

Authors: Shiori Sagawa*, Pang Wei Koh*, Tatsunori B. Hashimoto, Percy Liang

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By coupling group DRO models with increased regularization a stronger-than-typical ℓ2 penalty or early stopping we achieve substantially higher worst-group accuracies, with 10 40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies.
Researcher Affiliation Collaboration Shiori Sagawa Stanford University ssagawa@cs.stanford.edu Pang Wei Koh Stanford University pangwei@cs.stanford.edu Tatsunori B. Hashimoto Microsoft tahashim@microsoft.com Percy Liang Stanford University pliang@cs.stanford.edu
Pseudocode Yes Algorithm 1: Online optimization algorithm for group DRO
Open Source Code Yes Code for training group DRO models is available at https://github.com/kohpangwei/ group_DRO.
Open Datasets Yes We study group DRO in the context of overparameterized neural networks in three applications (Figure 1) natural language inference with the Multi NLI dataset (Williams et al., 2018), facial attribute recognition with Celeb A (Liu et al., 2015), and bird photograph recognition with our modified version of the CUB dataset (Wah et al., 2011).
Dataset Splits Yes we create our own validation and test sets by combining the training set and development set and then randomly shuffling them into a 50 20 30 train-val-test split.
Hardware Specification No The paper does not specify the exact hardware used for experiments, such as specific GPU or CPU models.
Software Dependencies No The paper mentions software like Pytorch, torchvision, Hugging Face pytorch-transformers, and AdamW, but it does not specify version numbers for these software components, which is necessary for reproducibility.
Experiment Setup Yes We train the Res Net50 models using stochastic gradient descent with a momentum term of 0.9 and a batch size of 128;...We use a fixed learning rate... For the standard training experiments in Section 3.1, we use a ℓ2 penalty of λ = 0.0001... with a learning rate of 0.001 for Waterbirds and 0.0001 for Celeb A. We train the Celeb A models for 50 epochs and the Waterbirds models for 300 epochs.