Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Decoding Causal Structure: End-to-End Mediation Pathways Inference

Authors: Yulong Li, Xiwei Liu, feilong tang, Ming Hu, Jionglong Su, Zongyuan Ge, Imran Razzak, Eran Segal

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments We evaluate SIGMA across two key tasks: (1) causal structure discovery and (2) causal mediation analysis, covering both synthetic datasets with known ground-truth effects and a real-world medical cohort from the HPP. For structure discovery, we assess the accuracy of learned causal graphs in various settings with different graph topologies, nonlinearity levels, and data types. For mediation analysis, we test SIGMA s ability to estimate direct and indirect effects of identified mediation pathways. The synthetic data allows controlled benchmarking, while the HPP cohort provides a complex, high-dimensional setting with clinically relevant mediation structures.
Researcher Affiliation	Academia	1Mohamed bin Zayed University of Artificial Intelligence 2 Weizmann Institute of Science 3Xi an Jiaotong-Liverpool University 4Monash University
Pseudocode	Yes	Algorithm 1 VAE-based Missing Value Imputation for Mixed-Type Data
Open Source Code	No	Justification: We have thoroughly described the generation methods and parameters for the synthetic data in Appendix, ensuring the reproducibility of these experiments. For the HPP dataset, we have provided a link to its official knowledge base (https://knowledgebase.pheno.ai/) for reference.
Open Datasets	Yes	Human Phenotype Project (HPP)1. We conduct our real-world evaluation using the HPP dataset, a deeply phenotyped cohort comprising over 6,000 individuals with multi-night home sleep apnea testing (HSAT) data across 16,000+ nights. [...] 1https://knowledgebase.pheno.ai/ [...] Synthetic Data. To evaluate SIGMA s ability to identify mediation pathways under controlled yet realistic conditions, we generate synthetic datasets that mirror key statistical and structural characteristics of real-world data (Figure 1(d)). [...] See Appendix E.1 for detailed generation procedures and E.2 for embedded mediation pathway structures.
Dataset Splits	Yes	Effect estimation employs a 5-fold cross-fitting strategy, utilizing EIF for paths of length 3 and plug-in estimators for longer paths.
Hardware Specification	No	Justification: We discuss in Appendix.
Software Dependencies	No	PRV features were derived from the peripheral arterial tonometry signal using the Neuro Kit2 library [42],
Experiment Setup	Yes	In the structure discovery phase, we configure the Flow-SEM model with a two-layer MLP (hidden_dim_multiplier=2) as the conditional distribution model. The model is trained using the Adam optimizer with a learning rate of 0.001, an initial DAG constraint penalty coefficient λ = 0.1, which increases by a factor of 10 when the acyclicity constraint is violated (h_threshold=10 8), up to a maximum of 105. Sparsity is controlled through L1 regularization (α = 0.01), with a maximum gradient norm limit of 1.0. The training iterates for 1000 epochs, with early stopping if no improvement occurs for 100 consecutive epochs. In the posterior sampling phase, we extract 1000 DAG samples from the learned structure. During sampling, each node retains only the top 3 highest probability incoming edges to control sparseness, and a sigmoid function converts the weight matrix to edge existence probabilities. Graph structures are randomly sampled according to a Bernoulli distribution, validated for acyclicity through topological sorting, and cyclic graphs are discarded. For path identification, we set the frequency threshold to 5%, meaning paths appearing with frequency above this threshold in the DAG sample set are identified as stable paths. The maximum path length is limited to 10, and multi-processor parallel computing is employed to accelerate the identification process. The mechanism modeling phase implements 5-fold cross-validation. Each model uses uniform parameters: 50 iterations, batch size of 128, learning rate of 10 4, and hidden layer dimension of 128. For continuous treatment variables, intervention levels are set at the mean (µ) and the mean plus one standard deviation (µ + σ), while binary variables use intervention levels {0,1}.