Data preprocessing to mitigate bias: A maximum entropy based approach
Authors: L. Elisa Celis, Vijay Keswani, Nisheeth Vishnoi
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluate the fairness and accuracy of the distributions generated by applying our framework to the Adult and COMPAS datasets, with gender as the protected attribute. 5. Empirical analysis |
| Researcher Affiliation | Academia | 1Department of Statistics and Data Science, Yale University, USA 2Department of Computer Science, Yale University, USA. |
| Pseudocode | Yes | Algorithm 1 Re-weighting algorithm to assign weights to samples for the prior distribution |
| Open Source Code | Yes | 2The code for our framework is available at https: //github.com/vijaykeswani/Fair-Max-Entropy Distributions. |
| Open Datasets | Yes | (a) The COMPAS dataset (Angwin et al., 2016; Larson et al., 2016) (b) The Adult dataset (Dheeru & Karra Taniskidou, 2017) |
| Dataset Splits | Yes | We perform 5-fold cross-validation for every dataset, i.e., we divide each dataset into five partitions. First, we select and combine four partitions into a training dataset and use this dataset to construct the distributions. |
| Hardware Specification | Yes | The machine specifications are a 1.8Ghz Intel Core i5 processor with 8GB memory. |
| Software Dependencies | No | The paper mentions using a 'decision tree classifier with gini information criterion' and 'Gaussian naive Bayes classifier' but does not provide specific version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | We perform 5-fold cross-validation for every dataset, i.e., we divide each dataset into five partitions. First, we select and combine four partitions into a training dataset and use this dataset to construct the distributions. Then we sample 10,000 elements from each distribution and train the classifier on this simulated dataset. This sampling process is repeated 100 times for each distribution. We repeat this process 5 times for each dataset, once for each fold. |