Assessing Fairness in the Presence of Missing Data
Authors: Yiliang Zhang, Qi Long
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide upper and lower bounds on the fairness estimation error and conduct numerical experiments to assess our theoretical results. Our work provides the first known theoretical results on fairness guarantee in analysis of incomplete data. |
| Researcher Affiliation | Academia | Yiliang Zhang University of Pennsylvania Philadelphia, PA 19104, USA zylthu14@sas.upenn.edu Qi Long University of Pennsylvania Philadelphia, PA 19104, USA qlong@upenn.edu |
| Pseudocode | No | The paper describes algorithms and methods in text and mathematical formulations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We conduct analyses of two real datasets, one from the COMPAS and the other from the ADNI. The dataset analyzed in this work contains records of defendants from Broward County from 2013 and 2014. The dataset from the Alzheimer s Disease Neuroimaging Initiative (ADNI) contains gene expression and clinical data for 649 patients. |
| Dataset Splits | Yes | In each experiment, we randomly split the real dataset into two subsets. In the first subset, we generate missing values, and the complete cases in this subset are used to train a random forest prediction model g and estimate its fairness in the complete data domain. The true fairness T (g) is approximated using the entire second subset. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, memory specifications). |
| Software Dependencies | No | The paper mentions using statistical models and algorithms like 'logistic regression', 'random forest', 'support vector machine', and 'XGBoost', but it does not specify any version numbers for these software components or libraries. |
| Experiment Setup | Yes | In our simulation experiments, we assess the upper bound in Theorem 1 in a classification task and the lower bound in Theorem 2 in a regression task. In each experiment, we generate 10 predictors and a binary sensitive attribute A {0, 1} with n samples. Unless noted otherwise, the predictors are generated from Gaussian distributions: xij N(1 2Ai, 0.52). We use a set of 2000 data [...] to train a prediction algorithm g. We use linear SVM as the prediction model g. We vary the total sample size n from 103 to 105, and for each fixed n we examine different levels of sample imbalance between the two sensitive groups by varying the ratio of n1/n0 from 1 to 20. Missingness of data is generated under MAR using the following model logit(π(zi, Ai)) = 2 - 1/5 P10 j=1 xij. |