reproducibilityindex.ai

On the Within-Group Fairness of Screening Classifiers

Authors: Nastaran Okati, Stratis Tsirtsis, Manuel Gomez Rodriguez

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we create multiple instances of a simulated screening process using US Census survey data to first investigate how frequently within-group unfairness occurs in a recruiting domain and then compare the partitions, as well as induced screening classifiers, provided by Algorithms 1, 2 and 3. We use a dataset consisting of 3.2 million individuals from the US Census (Ding et al., 2021). Each individual is represented by sixteen features and one label y {0, 1} indicating whether the individual is employed (y = 1) or not (y = 0).
Researcher Affiliation	Academia	1Max Planck Institute for Software Systems. Correspondence to: Nastaran Okati <nastaran@mpi-sws.org>.
Pseudocode	Yes	Algorithm 1 It returns a partition Bpav such that f Bpav is within-group monotone. Algorithm 2 It returns the optimal partition B such that f B is within-group monotone. Algorithm 3 It returns the optimal partition B cal such that f B cal within-group calibrated.
Open Source Code	Yes	An implementation of our algorithms and the data used in our experiments are available at https://github.com/Networks-Learning/within-group-monotonicity.
Open Datasets	Yes	We use a dataset consisting of 3.2 million individuals from the US Census (Ding et al., 2021). For the experiments, we randomly split the dataset into two equally-sized and disjoint subsets. We use the first subset for training and calibration and the second subset for testing. More specifically, for each experiment, we create the training and calibration sets Dtr and Dcal by picking 100,000 and 50,000 individuals at random (without replacement) from the first subset.
Dataset Splits	Yes	More specifically, for each experiment, we create the training and calibration sets Dtr and Dcal by picking 100,000 and 50,000 individuals at random (without replacement) from the first subset. We use Dtr to train a logistic regression model f LR and use Dcal to both (approximately) calibrate f LR using uniform mass binning (UMB) (Wang et al., 2022; Zadrozny & Elkan, 2001).
Hardware Specification	Yes	We ran all experiments on a machine equipped with 48 Intel(R) Xeon(R) 2.50GHz CPU cores and 256GB memory.
Software Dependencies	No	The paper mentions using a "logistic regression model" but does not specify the software libraries or their version numbers (e.g., Python, scikit-learn, PyTorch, etc.) used for implementation.
Experiment Setup	Yes	We use Dtr to train a logistic regression model f LR and use Dcal to both (approximately) calibrate f LR using uniform mass binning (UMB) (Wang et al., 2022; Zadrozny & Elkan, 2001), i.e., discretize its outputs to n calibrated quality scores, and estimate the relevant probabilities ρi, ai, ρz \| i and ai,z needed by Algorithms 1, 2, and 3. We experiment with several screening classifiers f with a varying number of bins n.