Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Local Causal Discovery for Structural Evidence of Direct Discrimination

Authors: Jacqueline Maasch, Kyra Gan, Violet Chen, Agni Orfanoudaki, Nil-Jana Akpinar, Fei Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use LD3 to analyze causal fairness in two complex decision systems: criminal recidivism prediction and liver transplant allocation. LD3 was more time-efficient and returned more plausible results on real-world data than baselines, which took 46 to 5870 longer to execute.
Researcher Affiliation Collaboration Jacqueline Maasch1, Kyra Gan1, Violet Chen2, Agni Orfanoudaki3, Nil-Jana Akpinar4*, Fei Wang5 1Cornell Tech 2Stevens Institute of Technology 3University of Oxford 4Amazon AWS AI/ML (*Work done outside Amazon) 5Weill Cornell Medicine
Pseudocode Yes Algorithm 1: LD3 Input: Exposure X, outcome Y , variable set Z, CI test of choice, significance level α. Output: Adjustment set ADE, SDC results. Assumptions: Sufficient conditions A1 and A2.
Open Source Code Yes 1Code on Git Hub: https://github.com/jmaasch/LD3
Open Datasets Yes We assessed the ability of LD3 to facilitate CFA on the Pro Publica COMPAS dataset. (...) All baselines were assessed on the SANGIOVESE benchmark from the bnlearn repository (Scutari 2010) (...). We use the National Standard Transplant Analysis and Research (STAR) dataset (OPTN 2024) for adult patients during 2017-2019 (n = 21, 101) and 2020-2022 (n = 22, 807).
Dataset Splits Yes Ten replicate datasets were sampled at n = [250, 500, 1000]. (...) Estimators used random forest classifiers with a 70% / 30% train-test split.
Hardware Specification Yes All experiments used an Apple Mac Book (M2 Pro Chip).
Software Dependencies No The paper mentions software components like "double machine learning" (with a citation to Chernozhukov et al. 2018 and a URL to econml) and the "bnlearn R Package" (Scutari 2010), but it does not provide specific version numbers for these or other software libraries.
Experiment Setup Yes All constraint-based methods used Fisher-z tests (α = 0.01). (...) Causal discovery used χ2 CI tests and WCDE estimation used double machine learning (...) Estimators used random forest classifiers with a 70% / 30% train-test split. (...) We used three significance levels for independence testing (α = 0.005, 0.01, 0.05) to assess stability of results