Data preprocessing to mitigate bias: A maximum entropy based approach

Authors: L. Elisa Celis, Vijay Keswani, Nisheeth Vishnoi

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we evaluate the fairness and accuracy of the distributions generated by applying our framework to the Adult and COMPAS datasets, with gender as the protected attribute. 5. Empirical analysis
Researcher Affiliation Academia 1Department of Statistics and Data Science, Yale University, USA 2Department of Computer Science, Yale University, USA.
Pseudocode Yes Algorithm 1 Re-weighting algorithm to assign weights to samples for the prior distribution
Open Source Code Yes 2The code for our framework is available at https: //github.com/vijaykeswani/Fair-Max-Entropy Distributions.
Open Datasets Yes (a) The COMPAS dataset (Angwin et al., 2016; Larson et al., 2016) (b) The Adult dataset (Dheeru & Karra Taniskidou, 2017)
Dataset Splits Yes We perform 5-fold cross-validation for every dataset, i.e., we divide each dataset into five partitions. First, we select and combine four partitions into a training dataset and use this dataset to construct the distributions.
Hardware Specification Yes The machine specifications are a 1.8Ghz Intel Core i5 processor with 8GB memory.
Software Dependencies No The paper mentions using a 'decision tree classifier with gini information criterion' and 'Gaussian naive Bayes classifier' but does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup Yes We perform 5-fold cross-validation for every dataset, i.e., we divide each dataset into five partitions. First, we select and combine four partitions into a training dataset and use this dataset to construct the distributions. Then we sample 10,000 elements from each distribution and train the classifier on this simulated dataset. This sampling process is repeated 100 times for each distribution. We repeat this process 5 times for each dataset, once for each fold.