Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Active Fourier Auditor for Estimating Distributional Properties of ML Models

Authors: Ayoub Ajarra, Bishwamittra Ghosh, Debabrota Basu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the performance of AFA in estimating multiple models group fairness, robustness, and individual fairness. Below, we provide a detailed discussion of the experimental setup, objectives, and results. Experimental Setup: We conduct experiments on COMPAS (Angwin et al. 2016), student performance (Student) (Cortez and Silva 2008), and drug consumption (Drug) (Fehrman et al. 2019) datasets. The datasets contain a mix of binary, categorical, and continuous features for binary and multi-class classification. We evaluate AFA on three ML models: Logistic Regression (LR), Multi-layer Perceptron (MLP), and Random Forest (RF). The ground truth of group fairness, individual fairness, and robustness is computed using the entire dataset as in (Yan and Zhang 2022). For group fairness, we compare AFA with uniform sampling method, namely Uniform, and the active fairness auditing algorithms (Yan and Zhang 2022, Algorithm 3), i.e. CAL and its variants µCAL and randomized µCAL, which requires more information about the model class than blackbox access. We report the best variant of CAL with the lowest error. For robustness and individual fairness, we compare AFA with Uniform. Each experiment is run 10 times and we report the averages. We refer to Appendix F.1 for details.
Researcher Affiliation	Academia	Ayoub Ajarra1, Bishwamittra Ghosh2, Debabrota Basu1 1 Equipe Scool, Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189CRISt AL, Lille, France 2Max Planck Institute for Software Systems, Germany
Pseudocode	Yes	Algorithm 1: Active Fourier Auditor (AFA)
Open Source Code	Yes	Code https://github.com/ayoubajarra/afamp
Open Datasets	Yes	We conduct experiments on COMPAS (Angwin et al. 2016), student performance (Student) (Cortez and Silva 2008), and drug consumption (Drug) (Fehrman et al. 2019) datasets.
Dataset Splits	No	The paper mentions using datasets but does not specify how they are split into training, validation, or test sets for the experiments conducted. It states, "The ground truth of group fairness, individual fairness, and robustness is computed using the entire dataset as in (Yan and Zhang 2022)." but this refers to the ground truth calculation, not data splits for model evaluation.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. It only describes the experimental setup in terms of datasets and models.
Software Dependencies	No	The paper does not explicitly list any specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, specific libraries, or solvers).
Experiment Setup	No	The paper describes the comparison methods and states that experiments were run 10 times with averages reported, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or system-level training settings for the ML models (LR, MLP, RF) or the AFA algorithm's parameters (e.g., τ, δ).