Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Size-adaptive Hypothesis Testing for Fairness

Authors: Antonio Ferrara, Francesco Cozzi, Alan Perotti, André Panisson, Francesco Bonchi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach empirically on benchmark datasets, demonstrating how our tests provide interpretable, statistically rigorous decisions under varying degrees of data availability and intersectionality.
Researcher Affiliation	Collaboration	Antonio Ferrara CENTAI, Turin, Italy Graz University of Technology, Austria EMAIL Francesco Cozzi Sapienza University, Rome, Italy CENTAI, Turin, Italy EMAIL Alan Perotti CENTAI, Turin, Italy EMAIL André Panisson CENTAI, Turin, Italy EMAIL Francesco Bonchi CENTAI, Turin, Italy EURECAT, Barcelona, Spain EMAIL
Pseudocode	Yes	Algorithm 1 summarizes our rigorous statistical framework for fairness assessment, which is specifically designed to handle varying subgroup sizes in intersectional settings. It integrates large-sample hypothesis testing with Bayesian inference, ensuring that fairness evaluations remain reliable even when data availability differs across subgroups. It dynamically adjusts significance thresholds, thus accounting for statistical uncertainty and preventing misleading conclusions about bias.
Open Source Code	Yes	Our full codebase including data preprocessing, model training, and auditing notebooks can be found at https://github.com/alanturin-g/SAFT.
Open Datasets	Yes	We selected two standard datasets 1 used in fairness benchmarks: the Adult Income dataset [5] and the COMPAS recidivism dataset [2]. 1COMPAS from propublica (compas-scores-two-years.csv); Adult from fairlearn.org, originally from UCI. The additional experiments include the German Credit [32] and the Student Performance [13] datasets.
Dataset Splits	Yes	We consider both COMPAS and Adult datasets over 20 random 2:1 train-test splits, followed by model training and fairness auditing each subgroup on every split. We perform 20 independent 2:1 stratified train test splits to account for sampling variability.
Hardware Specification	Yes	All experiments were run on a server equipped with an Intel Xeon Gold 6312U CPU and 256 GB of RAM.
Software Dependencies	No	The paper mentions using an XGBoost classifier and lists its parameters, but does not specify a version number for XGBoost itself or any other software dependencies like Python or specific libraries with their versions.
Experiment Setup	Yes	In all experiments, we train an XGBoost classifier [10] on each dataset without hyperparameter tuning, since our focus is on fairness auditing rather than maximizing predictive performance. The following default XGBoost parameters were used for all runs: n_estimators = 100 max_depth = 6 learning_rate = 0.3 subsample = 1.0 colsample_bytree = 1.0 objective = "binary:logistic" eval_metric = "logloss"