Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Statistical Inference for Fairness Auditing

Authors: John J. Cherian, Emmanuel J. Candès

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test the proposed methods on benchmark datasets in predictive inference and algorithmic fairness and ﬁnd that our audits can provide interpretable and trustworthy guarantees.
Researcher Affiliation	Academia	John J. Cherian EMAIL Department of Statistics Stanford University Stanford, CA 94305, USA Emmanuel J. Cand es EMAIL Departments of Mathematics and Statistics Stanford University Stanford, CA 94305, USA
Pseudocode	Yes	Algorithm 1 Bootstrapping the lower conﬁdence bound critical value Algorithm 2 Bootstrapping the (rescaled) lower conﬁdence bound critical value Algorithm 3 Bootstrapping the Boolean certiﬁcate critical value Algorithm 4 Constructing p-values for G G Algorithm 5 Bootstrapping the RKHS conﬁdence set critical value
Open Source Code	Yes	A Python package, fairaudit, implementing these methods is available to install from Py PI and can be downloaded at github.com/jjcherian/fairaudit.
Open Datasets	Yes	We test the proposed methods on benchmark datasets in predictive inference and algorithmic fairness and ﬁnd that our audits can provide interpretable and trustworthy guarantees. Following Angwin et al. (2016), we apply our auditing method to a data set obtained by Pro Publica in 2016 that includes COMPAS risk scores... for n = 6781 individuals. We evaluate the ﬂagging methodology on an income prediction dataset derived from the 2018 Census American Community Survey Public Use Microdata and made available in the Folktables package (Ding et al., 2021; Flood, 2015).
Dataset Splits	Yes	Using held-out data sets of varying size, we issue Boolean certiﬁcates for sub-intervals G [0, 1] over which ϵ(G) < 1. ...we ﬁt logistic and linear regression models to a training set of 1000 data points, and then sample holdout sets of varying size from the remaining data.
Hardware Specification	Yes	running a certiﬁcation audit at the largest sample size considered (n = 1600) takes under 7 seconds on a 2020 Mac Book Pro. each audit takes under one second on a 2020 Macbook Pro. This ﬂagging audit takes under 0.5 seconds on a 2020 Macbook Pro. each audit takes approximately 1.5 seconds on a 2020 Mac Book Pro.
Software Dependencies	No	A Python package, fairaudit, implementing these methods is available to install from Py PI and can be downloaded at github.com/jjcherian/fairaudit.
Experiment Setup	Yes	We set the nominal error rate to 0.1 and vary the sample size n over (100, 200, 400, 800, 1600). We evaluate our method using 500 bootstrap samples we sample β0 iid N(0, 1) and use 1000 training points, (Xi, Yi), to ﬁt the OLS predictor, f(x) = ˆβ x. We then audit over covariate shifts corresponding to the nonnegative functions belonging to the unit ball of a Gaussian RKHS with varying bandwidths σ {0.1, 0.5, 1}.