On the Within-Group Fairness of Screening Classifiers

Authors: Nastaran Okati, Stratis Tsirtsis, Manuel Gomez Rodriguez

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we create multiple instances of a simulated screening process using US Census survey data to first investigate how frequently within-group unfairness occurs in a recruiting domain and then compare the partitions, as well as induced screening classifiers, provided by Algorithms 1, 2 and 3. We use a dataset consisting of 3.2 million individuals from the US Census (Ding et al., 2021). Each individual is represented by sixteen features and one label y {0, 1} indicating whether the individual is employed (y = 1) or not (y = 0).
Researcher Affiliation Academia 1Max Planck Institute for Software Systems. Correspondence to: Nastaran Okati <nastaran@mpi-sws.org>.
Pseudocode Yes Algorithm 1 It returns a partition Bpav such that f Bpav is within-group monotone. Algorithm 2 It returns the optimal partition B such that f B is within-group monotone. Algorithm 3 It returns the optimal partition B cal such that f B cal within-group calibrated.
Open Source Code Yes An implementation of our algorithms and the data used in our experiments are available at https://github.com/Networks-Learning/within-group-monotonicity.
Open Datasets Yes We use a dataset consisting of 3.2 million individuals from the US Census (Ding et al., 2021). For the experiments, we randomly split the dataset into two equally-sized and disjoint subsets. We use the first subset for training and calibration and the second subset for testing. More specifically, for each experiment, we create the training and calibration sets Dtr and Dcal by picking 100,000 and 50,000 individuals at random (without replacement) from the first subset.
Dataset Splits Yes More specifically, for each experiment, we create the training and calibration sets Dtr and Dcal by picking 100,000 and 50,000 individuals at random (without replacement) from the first subset. We use Dtr to train a logistic regression model f LR and use Dcal to both (approximately) calibrate f LR using uniform mass binning (UMB) (Wang et al., 2022; Zadrozny & Elkan, 2001).
Hardware Specification Yes We ran all experiments on a machine equipped with 48 Intel(R) Xeon(R) 2.50GHz CPU cores and 256GB memory.
Software Dependencies No The paper mentions using a "logistic regression model" but does not specify the software libraries or their version numbers (e.g., Python, scikit-learn, PyTorch, etc.) used for implementation.
Experiment Setup Yes We use Dtr to train a logistic regression model f LR and use Dcal to both (approximately) calibrate f LR using uniform mass binning (UMB) (Wang et al., 2022; Zadrozny & Elkan, 2001), i.e., discretize its outputs to n calibrated quality scores, and estimate the relevant probabilities ρi, ai, ρz | i and ai,z needed by Algorithms 1, 2, and 3. We experiment with several screening classifiers f with a varying number of bins n.