Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Impact of Imputation Strategies on Fairness in Machine Learning

Authors: Simon Caton, Saiteja Malisetty, Christian Haas

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this article, we investigate the impact of different imputation strategies on classical performance and fairness in classiﬁcation settings. We ﬁnd that the selected imputation strategy, along with other factors including the type of classiﬁcation algorithm, can signiﬁcantly affect performance and fairness outcomes. The results of our experiments indicate that the choice of imputation strategy is an important factor when considering fairness in Machine Learning. We run structured experiments to iterate over a range of parameters, including different classiﬁcation algorithms, imputation strategies, and percentage of missing values. Each scenario is independently repeated 100 times (totalling 208800 observations) to calculate distributions for the performance and fairness metrics corresponding to the classiﬁcation outcomes. We analyze the resulting impact using statistical techniques and identify scenarios where imputation strategies and choice of Machine Learning model affect the fairness of the predictions.
Researcher Affiliation	Academia	Simon Caton EMAIL School of Computer Science, University College Dublin, Ireland Saiteja Malisetty EMAIL University of Nebraska at Omaha Christian Haas EMAIL Department of Strategy and Innovation Vienna University of Economics and Business (WU), Austria
Pseudocode	Yes	Algorithm 1: Pseudocode of Evaluation Setup Data: Datasets: ds {German,Adult,COMPAS} Data: Imputation Strategies: 8 strategies for numerical imputation, 2 for categorical imputation Data: Classiﬁcation Algorithms: CA {Logistic Regression,Linear SVC,Random Forest} Data: Repetition: i [1 : 100] Data: Number of columns: j [1 : ncol(dataset)] Data: Percentage of deleted values: p {0.01,0.05,0.1} Result: Solution csv with Performance and Fairness metrics begin Select dataset; for repetition i do for number of columns to consider j do Pick j columns from dataset at random; for percentage p do for column in columns to consider do Randomly delete p percent of values in column; for imputation strategy imp do Impute missing values based on strategy imp; for algorithm algo do Train classiﬁer using algo; Calculate performance and fairness metrics; Add results to solution csv; Repeat for other datasets; return solution csv;
Open Source Code	Yes	2. The code and datasets used to run our experiment can be found in the following Git Hub Repository: https: //github.com/haas-christian/JAIR-Imputation.
Open Datasets	Yes	We consider the impact of missing values, and their corresponding imputation, on the classiﬁcation outcome using three commonly used datasets in the fairness literature: Adult Income, COMPAS, and German Credit. As these three datasets are well structured and, in standard implementations, do not contain missing values, we randomly delete different percentages of values from the three datasets (i.e., data is missing completely at random) and impute the data using nine different imputation techniques to compare how these imputation techniques affect various fairness and performance metrics.
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits. It describes how missing values were introduced (1, 5, and 10% deleted values per column) and that each scenario was repeated 100 times. It also mentions calculating baseline observations for the original datasets and training classifiers, implying splits were used, but the specific splitting methodology (e.g., percentages, cross-validation, or predefined splits) is not detailed for the original datasets or the experimental runs.
Hardware Specification	Yes	All experiments and analyses were run on a 12 core workstation with 128 GB RAM to parallelise the independent repetitions.
Software Dependencies	No	The paper mentions using "Python as programming language" and utilizing "the open source package AIF360 (Bellamy et al., 2019)" and "standard scikit-learn implementations." However, it does not provide specific version numbers for Python, AIF360, or scikit-learn, which are required for a reproducible description of ancillary software.
Experiment Setup	No	The paper describes the experimental factors such as datasets (German, Adult, COMPAS), imputation strategies (Mean, Median, Most Frequent, k NN, Iterative, Interpolate, Least Squares, Stochastic, Norm), classification algorithms (Logistic Regression, Random Forest, Linear Support Vector Classiﬁer), and percentage of deleted values (1, 5, 10%). It also states that each scenario was repeated 100 times. However, it does not provide specific hyperparameter values (e.g., learning rate, regularization strength, number of trees, max_depth, C values) for the classification algorithms used, only stating they require "relatively little parameter tuning" and use "standard scikit-learn implementations," which is insufficient for detailed reproducibility.