Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bitrate-Constrained DRO: Beyond Worst Case Robustness To Unknown Group Shifts

Authors: Amrith Setlur, Don Dennis, Benjamin Eysenbach, Aditi Raghunathan, Chelsea Finn, Virginia Smith, Sergey Levine

ICLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6 EXPERIMENTS Our experiments aim to evaluate the performance of BR-DRO and compare it with ERM and group shift robustness methods that do not require group annotations for training examples. We conduct empirical analyses along the following axes: (i) worst group performance on datasets that exhibit known spurious correlations; (ii) robustness to random label noise in the training data; (iii) average performance on hybrid covariate shift datasets with unspecified groups; and (iv) accuracy in identifying minority groups. See Appendix B for additional experiments and details3. ... Table 1 compares the average and worst group accuracy for BR-DRO with ERM and four group shift robustness baselines...
Researcher Affiliation	Academia	1 Carnegie Mellon University 2 Stanford University 3 UC Berkeley
Pseudocode	Yes	Thus, we provide an algorithm where both learner and adversary optimize BR-DRO iteratively through stochastic gradient ascent/descent (Algorithm 1 in Appendix A.1).
Open Source Code	Yes	The code used in our experiments can be found at https://github.com/ars22/bitrate_DRO.
Open Datasets	Yes	(i) Waterbirds (Wah et al., 2011) (background is spurious), Celeb A (Liu et al., 2015) (binary gender is spuriously correlated with label blond ); and Civil Comments (WILDS) (Borkan et al., 2019) where the task is to predict toxic texts and there are 16 predefined groups Koh et al. (2021).
Dataset Splits	Yes	To tune hyperparameters, like prior work we assume access to a some group annotations on validation set but also get decent performance (on some datasets) with only a balanced validation set (see Appendix B).
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup	No	The paper mentions 'Implementation details' and states 'We provide model selection methodology and other details in Appendix B', but does not explicitly provide concrete hyperparameter values, learning rates, batch sizes, or number of epochs in the main text.