Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness
Authors: Stephen Pfohl, Natalie Harris, Chirag Nagpal, David Madras, Vishwali Mhasawade, Olawale Salaudeen, Awa Dieng, Shannon Sequeira, Santiago Arciniegas, Lillian Sung, Nnamdi Ezeanochie, Heather Cole-Lewis, Katherine A. Heller, Sanmi Koyejo, Alexander D'Amour
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments with synthetic and real-world data to empirically verify the properties suggested by our theoretical analysis*. ...We conduct a simulation study and experiment with real-world tabular data. The purpose of these experiments is to empirically verify the properties discussed in Section 3. |
| Researcher Affiliation | Collaboration | Stephen R. Pfohl Google Research EMAIL Natalie Harris Google Research Chirag Nagpal Google Research David Madras Google Deep Mind Vishwali Mhasawade New York University Olawale Salaudeen Massachusetts Institute of Technology Awa Dieng Google Deep Mind Shannon Sequeira Google Research Santiago Arciniegas The Hospital for Sick Children Lillian Sung The Hospital for Sick Children Nnamdi Ezeanochie Google Heather Cole-Lewis Google Katherine Heller Google Research Sanmi Koyejo Stanford University Alexander D Amour Google Deep Mind |
| Pseudocode | Yes | Algorithm 1: Calculation of the weighted metric difference Ta with oracle access to P(A = a | V )... Algorithm 2: Calculation of the weighted metric difference Ta with cross-fitting. |
| Open Source Code | Yes | *Code to reproduce the experiments is available at https://github.com/google-research/ google-research/tree/master/causal_evaluation. |
| Open Datasets | Yes | For the real-world data experiments, we follow Ding et al. [50] to derive prediction tasks from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) provided by the U.S. Census Bureau [51]. |
| Dataset Splits | Yes | For each data generating process, we sample 70,000 independent samples and use 50,000 for training and 20,000 as a held-out testing dataset for evaluation. ...with stratified five-fold cross-validation |
| Hardware Specification | Yes | The simulation study was conducted on machines with 32 CPUs and 32 GB of RAM. ...As in the case of the simulation study, we conduct these experiments using machines with 32 CPUs and 32 GB of RAM. |
| Software Dependencies | Yes | For model fitting, we use the scikit-learn version [60] 1.6.1 implementation of gradient boosting classification trees... We compute confidence intervals for each bin separately using the Wilson Score Interval Method [61] with the implementation provided by the Statsmodels package version 0.12.1 [62]. |
| Experiment Setup | Yes | All model fitting and evaluation procedures are repeated and conducted separately for cases where prediction of Y is conducted with (1) X alone, (2) X and an additional categorical covariate indicating subgroup membership A, and (3) a set of models using X alone fit separately for each subgroup. For model fitting, we use the scikit-learn version [60] 1.6.1 implementation of gradient boosting classification trees (specifically, Hist Gradient Boosting Classifier) with stratified five-fold cross-validation, with a hyperparameter grid over the maximum number of leaf nodes in {10, 25, 50}, refitting the model over the training data using the hyperparameter setting with the minimum average log-loss over the held-out cross-validation folds. |