Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Retiring $\Delta \text{DP}$: New Distribution-Level Metrics for Demographic Parity

Authors: Xiaotian Han, Zhimeng Jiang, Hongye Jin, Zirui Liu, Na Zou, Qifan Wang, Xia Hu

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate the estimation tractability of our proposed fairness metrics and visualize the bias density. First, we show the relative estimation error of our proposed metrics is lower than that of mutual information (MI) in the experiments with synthetic data. Subsequently, we visualize the bias density for vanilla MLP and adversarial debiasing method in ACS-Income dataset. We conduct experiments on various datasets to re-evaluate the commonly-used fair models.
Researcher Affiliation	Collaboration	1Texas A&M University, 2Rice University, 3Meta AI EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Python code of ABPC... Algorithm 2 Python code of ABCC
Open Source Code	Yes	The code is available at https://github.com/ahxt/new_metric_for_demographic_parity.
Open Datasets	Yes	UCI Adult (Dua & Graff, 2017) contains clean information about 45, 222 individuals from the 1994 US Census. ... ACS-Income (Ding et al., 2021) derives from the American Community Survey (ACS) Public Use Microdata Sample (PUMS). ... ACS-Employment (Ding et al., 2021) also derives from the ACS PUMS. ... KDD Census (Dua & Graff, 2017) contains 284, 556 clean instances with 41 attributes. ... The Celeb A face attributes dataset (Liu et al., 2015) contains over 200,000 face images...
Dataset Splits	No	The paper mentions using "ten different dataset splits" in Section 6.3 but does not provide specific percentages, sample counts, or the methodology used for creating these splits (e.g., 80/10/10, random seed, or specific predefined files).
Hardware Specification	No	The paper does not specify any hardware details such as CPU, GPU, or memory used for running the experiments.
Software Dependencies	No	The paper mentions the use of Python libraries such as `numpy`, `scipy.stats.gaussian_kde`, and `statsmodels.distributions.empirical_distribution.ECDF` in the provided pseudocode, but it does not specify their version numbers or the version of Python used.
Experiment Setup	Yes	We train MLP and REG for 10 epochs and train ADV for 40 epochs. For REG, we set different values of the trade-off hyperparameter ̵̸̸̸̸̸̸̸̸̸ [0, 1] to control the accuracy-fairness trade-off and ̵̸̸̸̸̸̸̸̸̸ [10, 180] for ADV. We adopt a 4-layer fully-connected network and utilize Re LU (Nair & Hinton, 2010) as the activation function. The objective function is defined as Lce + ̵̸̸̸̸̸̸̸̸̸Ldp, where Lce is the cross-entropy loss for downstream task and Ldp is fairness constraint (Equation (2)).