Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Differentiable Constraint-Based Causal Discovery

Authors: Jincheng Zhou, Mengbo Wang, Anqi He, Yumeng Zhou, Hessam Olya, Murat Kocaoglu, Bruno Ribeiro

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations demonstrate the robust performance of our approach in low-sample regimes, surpassing traditional constraint-based and score-based baselines on a real-world dataset. Code and data of the proposed method are publicly available at https://github.com/Purdue MINDS/DAGPA. [...] This section presents an empirical evaluation of DAGPA s ability to discover DAGs whose dseparation statements are consistent with those derived from the underlying causal structure that generated the data.
Researcher Affiliation	Collaboration	Jincheng Zhou1 Mengbo Wang1 Anqi He2 Yumeng Zhou2 Hessam Olya2 Murat Kocaoglu3 Bruno Ribeiro1 1Purdue University 2Ford Motor Company 3Johns Hopkins University
Pseudocode	Yes	Algorithm 3 DAGPA Require: Data D, initial parameter θ0, number of steps T, number of best DAGs K, step size β 1: for t = 0 : T 1 do 2: Wt σ(θt) 3: for (x, y) [d]2 do Can be done in parallel 4: compute S(0) Wt and C(0) Wt as in Definition 3.6 5: for (x, y, z) [d]3 do Can be done in parallel 6: compute S(1) Wt and C(1) Wt as in Definition 3.6 7: compute LTP-0, LTP-1, LTN-0, LTN-1, LDAG as in Definition 4.1 8: U(θt) LTP-0 + LTP-1 + LTN-0 + LTN-1 + LDAG 9: compute θ(PC) t via PCGrad [43] (Algorithm 1) 10: compute θt+1 via DLP [50] (Algorithm 2) 11: compute At+1 by converting θt+1 to a discrete DAG (Appendix C.3) 12: compute TPTN-Ratio(At+1, D) 13: return K DAGs from {At}T t=1 with the Top-K highest TPTN-Ratio(At, D) score
Open Source Code	Yes	Code and data of the proposed method are publicly available at https://github.com/Purdue MINDS/DAGPA. [...] We release our code and data in https://github.com/Purdue MINDS/DAGPA
Open Datasets	Yes	Moreover, on the real-world Sachs dataset [28], DAGPA shows that differentiable d-separation offers accurate modeling of the independence patterns in the data, outperforming baselines in our metrics. [...] For real-world validation, we use the Sachs dataset [28], a benchmark protein signaling network with 11 variables and a known ground-truth causal structure derived from experimental interventions. [...] We also show the results with additional metrics on Sachs [28] dataset and on an additional real-world dataset Lucas [14] in Figure 9.
Dataset Splits	No	For each configuration, we generate 10 datasets. [...] We generated datasets with the same sample sizes n {100, 1000, 10000, 100000} and graph structures (ER and SF with d {10, 50} nodes and arc ratios r {2, 4}) as the binary setting to enable direct comparison across data types.
Hardware Specification	Yes	For the baselines, we ran PC, k PC, GES and linear version of NOTEARS, DAGMA on a 64-core AMD Epyc 7662 "Rome" processor with 16 CPU cores and 32 GB memory requested. The non-linear version of NOTEARS and DAGMA were run on one A30 with same CPU and memory requirement. Every experiment is completed in 4 hours. For DAGPA, we run all experiments on an AMD GPU cluster, equipped with 32GB MI108 and 64GB MI210 and EPYC 7V13 cpu with 64 cores.
Software Dependencies	No	In practice, we leverage Py Torch GPU tensor library for all such computations, avoiding any explicit for-loops and significantly improving the speed of optimization. [...] We used the causal-learn implementation[53] for PC and GES algorithms.
Experiment Setup	Yes	The most important and sensitive hyperparameter in DAGPA is the DLP sampling step size β (Equation (9)). To this end, we first find values for all other hyperparameters through preliminary experimentations then fix them, and only vary in the step size for the experiments on the synthetic binary dataset and the real-world datasets. Some of the other important hyperparameters and their values: DLP support logit set D: This hyperparameter controls the support logits that the model parameter θ can take during sampling. We use D = [ 2.0, 0.0, 2.0]. Log Mean Exp temperature α: ... In our experiments, for small graphs (include n = 10 synthetic binary dataset and both Sachs [28] and Lucas [14]) we use s = 3.0, while for large graphs (n = 50 synthetic binary dataset) we use s = 8.0. Finally, for the DLP step size β, for each dataset we choose a different range to run hyperparameter search and choose the best sampled DAGs therein according to the DAG selection score (Appendix C.3).