reproducibilityindex.ai

Data preprocessing to mitigate bias: A maximum entropy based approach

Authors: L. Elisa Celis, Vijay Keswani, Nisheeth Vishnoi

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we evaluate the fairness and accuracy of the distributions generated by applying our framework to the Adult and COMPAS datasets, with gender as the protected attribute. 5. Empirical analysis
Researcher Affiliation	Academia	1Department of Statistics and Data Science, Yale University, USA 2Department of Computer Science, Yale University, USA.
Pseudocode	Yes	Algorithm 1 Re-weighting algorithm to assign weights to samples for the prior distribution
Open Source Code	Yes	2The code for our framework is available at https: //github.com/vijaykeswani/Fair-Max-Entropy Distributions.
Open Datasets	Yes	(a) The COMPAS dataset (Angwin et al., 2016; Larson et al., 2016) (b) The Adult dataset (Dheeru & Karra Taniskidou, 2017)
Dataset Splits	Yes	We perform 5-fold cross-validation for every dataset, i.e., we divide each dataset into ﬁve partitions. First, we select and combine four partitions into a training dataset and use this dataset to construct the distributions.
Hardware Specification	Yes	The machine speciﬁcations are a 1.8Ghz Intel Core i5 processor with 8GB memory.
Software Dependencies	No	The paper mentions using a 'decision tree classiﬁer with gini information criterion' and 'Gaussian naive Bayes classiﬁer' but does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup	Yes	We perform 5-fold cross-validation for every dataset, i.e., we divide each dataset into ﬁve partitions. First, we select and combine four partitions into a training dataset and use this dataset to construct the distributions. Then we sample 10,000 elements from each distribution and train the classiﬁer on this simulated dataset. This sampling process is repeated 100 times for each distribution. We repeat this process 5 times for each dataset, once for each fold.