Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data
Authors: Esther Rolf, Theodora T Worledge, Benjamin Recht, Michael Jordan
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Empirical Results Having shown the importance of training set allocations from a theoretical perspective, we now provide a complementary empirical investigation of this phenomenon. See Appendix B for full details on each experimental setup. Figure 1 highlights the importance of at least a minimal representation of each group in order to achieve low population loss (black curves) for all objectives. |
| Researcher Affiliation | Academia | 1Department of EECS, University of California, Berkeley 2Department of Statistics, University of California, Berkeley. |
| Pseudocode | No | The paper describes its methods through mathematical formulations and narrative text, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code to replicate the experiments is available at https://github.com/estherrolf/representation-matters. |
| Open Datasets | Yes | We use a wide range of datasets to give a full empirical characterization of the phenomena of interest (see Table 1). The CIFAR-4 dataset is comprised of bird, car, horse, and plane image instances from CIFAR-10 (Krizhevsky, 2009). The ISIC dataset contains images of skin lesions labelled as benign or malignant (Codella et al., 2019). The Goodreads dataset consists of written book reviews and numerical ratings (Wan & Mc Auley, 2018). The Mooc dataset contains student demographic and participation data (Harvard X, 2014). The Adult dataset consists of demographic data from the 1994 Census (Dua & Graff, 2017). |
| Dataset Splits | Yes | We pick models and parameters via a cross-validation procedure over a coarse grid of α; details are given in Appendix B.3. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) used in the experiments. |
| Experiment Setup | Yes | We pick models and parameters via a cross-validation procedure over a coarse grid of α; details are given in Appendix B.3. For the image classification tasks, we compare group-agnostic empirical risk minimization (ERM) to importance weighting (implemented via importance sampling (IS) batches following the findings of Buda et al. (2018)) and group distributionally robust optimization (GDRO) with group-dependent regularization as in Sagawa et al. (2020). |