Provable Detection of Propagating Sampling Bias in Prediction Models
Authors: Pavan Ravishankar, Qingyu Mo, Edward McFowland III, Daniel B. Neill
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a theoretical analysis of how a specific form of data bias, differential sampling bias, propagates from the data stage to the prediction stage. Unlike prior work, we evaluate the downstream impacts of data biases quantitatively rather than qualitatively and prove theoretical guarantees for detection. Under reasonable assumptions, we quantify how the amount of bias in the model predictions varies as a function of the amount of differential sampling bias in the data, and at what point this bias becomes provably detectable by the auditor. Through experiments on two criminal justice datasets the well-known COMPAS dataset and historical data from NYPD s stop and frisk policy we demonstrate that the theoretical results hold in practice even when our assumptions are relaxed. |
| Researcher Affiliation | Academia | Pavan Ravishankar 1, Qingyu Mo 1, Edward Mc Fowland III 2, Daniel B. Neill 1 1Machine Learning for Good Laboratory, New York University 2Harvard Business School |
| Pseudocode | Yes | The Bias Scan algorithm for optimizing F(S) over rectangular subgroups is provided in the Technical Appendix. |
| Open Source Code | No | The paper does not provide a link to its own source code nor does it state that the code is available in supplementary materials or elsewhere. |
| Open Datasets | Yes | The public dataset compiled by Pro Publica4, including COMPAS risk predictions for 7,214 defendants in Broward County, Florida, from 2013-2014, and a two-year follow-up to record which defendants were rearrested, has been studied by numerous algorithmic bias researchers (Barenstein 2019). The dataset is further referenced in Footnote 4: https://github.com/propublica/compas-analysis/compasscores-two-years.csv. The NYPD data is also described as downloaded from the city's website6, with Footnote 6 providing the URL: www1.nyc.gov/site/nypd/stats/reports-analysis/stop-frisk.page. |
| Dataset Splits | No | For each trial, we randomly partition the data into 80% training and 20% testing data. The paper specifies training and testing splits but does not mention a validation split. |
| Hardware Specification | No | The paper does not mention any specific hardware specifications (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using |
| Experiment Setup | No | The paper describes the general setup, such as using random forest and logistic regression classifiers, performing 100 trials, and data partitioning (80% training, 20% testing). However, it does not provide specific hyperparameter values like learning rate, batch size, or optimizer settings for the models used. |