Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Local Causal Discovery for Structural Evidence of Direct Discrimination

Authors: Jacqueline Maasch, Kyra Gan, Violet Chen, Agni Orfanoudaki, Nil-Jana Akpinar, Fei Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use LD3 to analyze causal fairness in two complex decision systems: criminal recidivism prediction and liver transplant allocation. LD3 was more time-efficient and returned more plausible results on real-world data than baselines, which took 46 to 5870 longer to execute.
Researcher Affiliation	Collaboration	Jacqueline Maasch1, Kyra Gan1, Violet Chen2, Agni Orfanoudaki3, Nil-Jana Akpinar4, Fei Wang5 1Cornell Tech 2Stevens Institute of Technology 3University of Oxford 4Amazon AWS AI/ML (Work done outside Amazon) 5Weill Cornell Medicine
Pseudocode	Yes	Algorithm 1: LD3 Input: Exposure X, outcome Y , variable set Z, CI test of choice, significance level α. Output: Adjustment set ADE, SDC results. Assumptions: Sufficient conditions A1 and A2.
Open Source Code	Yes	1Code on Git Hub: https://github.com/jmaasch/LD3
Open Datasets	Yes	We assessed the ability of LD3 to facilitate CFA on the Pro Publica COMPAS dataset. (...) All baselines were assessed on the SANGIOVESE benchmark from the bnlearn repository (Scutari 2010) (...). We use the National Standard Transplant Analysis and Research (STAR) dataset (OPTN 2024) for adult patients during 2017-2019 (n = 21, 101) and 2020-2022 (n = 22, 807).
Dataset Splits	Yes	Ten replicate datasets were sampled at n = [250, 500, 1000]. (...) Estimators used random forest classifiers with a 70% / 30% train-test split.
Hardware Specification	Yes	All experiments used an Apple Mac Book (M2 Pro Chip).
Software Dependencies	No	The paper mentions software components like "double machine learning" (with a citation to Chernozhukov et al. 2018 and a URL to econml) and the "bnlearn R Package" (Scutari 2010), but it does not provide specific version numbers for these or other software libraries.
Experiment Setup	Yes	All constraint-based methods used Fisher-z tests (α = 0.01). (...) Causal discovery used χ2 CI tests and WCDE estimation used double machine learning (...) Estimators used random forest classifiers with a 70% / 30% train-test split. (...) We used three significance levels for independence testing (α = 0.005, 0.01, 0.05) to assess stability of results