Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Statistical Inference for Fairness Auditing
Authors: John J. Cherian, Emmanuel J. Candès
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test the proposed methods on benchmark datasets in predictive inference and algorithmic fairness and find that our audits can provide interpretable and trustworthy guarantees. |
| Researcher Affiliation | Academia | John J. Cherian EMAIL Department of Statistics Stanford University Stanford, CA 94305, USA Emmanuel J. Cand es EMAIL Departments of Mathematics and Statistics Stanford University Stanford, CA 94305, USA |
| Pseudocode | Yes | Algorithm 1 Bootstrapping the lower confidence bound critical value Algorithm 2 Bootstrapping the (rescaled) lower confidence bound critical value Algorithm 3 Bootstrapping the Boolean certificate critical value Algorithm 4 Constructing p-values for G G Algorithm 5 Bootstrapping the RKHS confidence set critical value |
| Open Source Code | Yes | A Python package, fairaudit, implementing these methods is available to install from Py PI and can be downloaded at github.com/jjcherian/fairaudit. |
| Open Datasets | Yes | We test the proposed methods on benchmark datasets in predictive inference and algorithmic fairness and find that our audits can provide interpretable and trustworthy guarantees. Following Angwin et al. (2016), we apply our auditing method to a data set obtained by Pro Publica in 2016 that includes COMPAS risk scores... for n = 6781 individuals. We evaluate the flagging methodology on an income prediction dataset derived from the 2018 Census American Community Survey Public Use Microdata and made available in the Folktables package (Ding et al., 2021; Flood, 2015). |
| Dataset Splits | Yes | Using held-out data sets of varying size, we issue Boolean certificates for sub-intervals G [0, 1] over which ϵ(G) < 1. ...we fit logistic and linear regression models to a training set of 1000 data points, and then sample holdout sets of varying size from the remaining data. |
| Hardware Specification | Yes | running a certification audit at the largest sample size considered (n = 1600) takes under 7 seconds on a 2020 Mac Book Pro. each audit takes under one second on a 2020 Macbook Pro. This flagging audit takes under 0.5 seconds on a 2020 Macbook Pro. each audit takes approximately 1.5 seconds on a 2020 Mac Book Pro. |
| Software Dependencies | No | A Python package, fairaudit, implementing these methods is available to install from Py PI and can be downloaded at github.com/jjcherian/fairaudit. |
| Experiment Setup | Yes | We set the nominal error rate to 0.1 and vary the sample size n over (100, 200, 400, 800, 1600). We evaluate our method using 500 bootstrap samples we sample β0 iid N(0, 1) and use 1000 training points, (Xi, Yi), to fit the OLS predictor, f(x) = ˆβ x. We then audit over covariate shifts corresponding to the nonnegative functions belonging to the unit ball of a Gaussian RKHS with varying bandwidths σ {0.1, 0.5, 1}. |