Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Impact of Imputation Strategies on Fairness in Machine Learning
Authors: Simon Caton, Saiteja Malisetty, Christian Haas
JAIR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this article, we investigate the impact of different imputation strategies on classical performance and fairness in classification settings. We find that the selected imputation strategy, along with other factors including the type of classification algorithm, can significantly affect performance and fairness outcomes. The results of our experiments indicate that the choice of imputation strategy is an important factor when considering fairness in Machine Learning. We run structured experiments to iterate over a range of parameters, including different classification algorithms, imputation strategies, and percentage of missing values. Each scenario is independently repeated 100 times (totalling 208800 observations) to calculate distributions for the performance and fairness metrics corresponding to the classification outcomes. We analyze the resulting impact using statistical techniques and identify scenarios where imputation strategies and choice of Machine Learning model affect the fairness of the predictions. |
| Researcher Affiliation | Academia | Simon Caton EMAIL School of Computer Science, University College Dublin, Ireland Saiteja Malisetty EMAIL University of Nebraska at Omaha Christian Haas EMAIL Department of Strategy and Innovation Vienna University of Economics and Business (WU), Austria |
| Pseudocode | Yes | Algorithm 1: Pseudocode of Evaluation Setup Data: Datasets: ds {German,Adult,COMPAS} Data: Imputation Strategies: 8 strategies for numerical imputation, 2 for categorical imputation Data: Classification Algorithms: CA {Logistic Regression,Linear SVC,Random Forest} Data: Repetition: i [1 : 100] Data: Number of columns: j [1 : ncol(dataset)] Data: Percentage of deleted values: p {0.01,0.05,0.1} Result: Solution csv with Performance and Fairness metrics begin Select dataset; for repetition i do for number of columns to consider j do Pick j columns from dataset at random; for percentage p do for column in columns to consider do Randomly delete p percent of values in column; for imputation strategy imp do Impute missing values based on strategy imp; for algorithm algo do Train classifier using algo; Calculate performance and fairness metrics; Add results to solution csv; Repeat for other datasets; return solution csv; |
| Open Source Code | Yes | 2. The code and datasets used to run our experiment can be found in the following Git Hub Repository: https: //github.com/haas-christian/JAIR-Imputation. |
| Open Datasets | Yes | We consider the impact of missing values, and their corresponding imputation, on the classification outcome using three commonly used datasets in the fairness literature: Adult Income, COMPAS, and German Credit. As these three datasets are well structured and, in standard implementations, do not contain missing values, we randomly delete different percentages of values from the three datasets (i.e., data is missing completely at random) and impute the data using nine different imputation techniques to compare how these imputation techniques affect various fairness and performance metrics. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits. It describes how missing values were introduced (1, 5, and 10% deleted values per column) and that each scenario was repeated 100 times. It also mentions calculating baseline observations for the original datasets and training classifiers, implying splits were used, but the specific splitting methodology (e.g., percentages, cross-validation, or predefined splits) is not detailed for the original datasets or the experimental runs. |
| Hardware Specification | Yes | All experiments and analyses were run on a 12 core workstation with 128 GB RAM to parallelise the independent repetitions. |
| Software Dependencies | No | The paper mentions using "Python as programming language" and utilizing "the open source package AIF360 (Bellamy et al., 2019)" and "standard scikit-learn implementations." However, it does not provide specific version numbers for Python, AIF360, or scikit-learn, which are required for a reproducible description of ancillary software. |
| Experiment Setup | No | The paper describes the experimental factors such as datasets (German, Adult, COMPAS), imputation strategies (Mean, Median, Most Frequent, k NN, Iterative, Interpolate, Least Squares, Stochastic, Norm), classification algorithms (Logistic Regression, Random Forest, Linear Support Vector Classifier), and percentage of deleted values (1, 5, 10%). It also states that each scenario was repeated 100 times. However, it does not provide specific hyperparameter values (e.g., learning rate, regularization strength, number of trees, max_depth, C values) for the classification algorithms used, only stating they require "relatively little parameter tuning" and use "standard scikit-learn implementations," which is insufficient for detailed reproducibility. |