Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Distributionally Robust Neural Networks

Authors: Shiori Sagawa*, Pang Wei Koh*, Tatsunori B. Hashimoto, Percy Liang

ICLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By coupling group DRO models with increased regularization a stronger-than-typical ℓ2 penalty or early stopping we achieve substantially higher worst-group accuracies, with 10 40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies.
Researcher Affiliation	Collaboration	Shiori Sagawa Stanford University EMAIL Pang Wei Koh Stanford University EMAIL Tatsunori B. Hashimoto Microsoft EMAIL Percy Liang Stanford University EMAIL
Pseudocode	Yes	Algorithm 1: Online optimization algorithm for group DRO
Open Source Code	Yes	Code for training group DRO models is available at https://github.com/kohpangwei/ group_DRO.
Open Datasets	Yes	We study group DRO in the context of overparameterized neural networks in three applications (Figure 1) natural language inference with the Multi NLI dataset (Williams et al., 2018), facial attribute recognition with Celeb A (Liu et al., 2015), and bird photograph recognition with our modiﬁed version of the CUB dataset (Wah et al., 2011).
Dataset Splits	Yes	we create our own validation and test sets by combining the training set and development set and then randomly shufﬂing them into a 50 20 30 train-val-test split.
Hardware Specification	No	The paper does not specify the exact hardware used for experiments, such as specific GPU or CPU models.
Software Dependencies	No	The paper mentions software like Pytorch, torchvision, Hugging Face pytorch-transformers, and AdamW, but it does not specify version numbers for these software components, which is necessary for reproducibility.
Experiment Setup	Yes	We train the Res Net50 models using stochastic gradient descent with a momentum term of 0.9 and a batch size of 128;...We use a ﬁxed learning rate... For the standard training experiments in Section 3.1, we use a ℓ2 penalty of λ = 0.0001... with a learning rate of 0.001 for Waterbirds and 0.0001 for Celeb A. We train the Celeb A models for 50 epochs and the Waterbirds models for 300 epochs.