Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Characterization and Learning of Causal Graphs from Hard Interventions

Authors: Zihan Zhou, Muhammad Qasim Elahi, Murat Kocaoglu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted empirical experiments in Appendix F to compare hard and soft interventions in learning the causal graph with latents. The results verify this observation. A fundamental question is how can we extract as much causal knowledge as possible from a collection of hard interventional datasets. To the best of our knowledge, this problem has been open before this work. We conducted experiments (Appendix F) to compare the size of I-Markov equivalence class under hard and soft interventions to show that hard interventions on average provide more information about the causal graph.
Researcher Affiliation	Academia	Zihan Zhou* Department of Computer Science Johns Hopkins University EMAIL Muhammad Qasim Elahi* School of Electrical and Computer Engineering Purdue University EMAIL Murat Kocaoglu Department of Computer Science Johns Hopkins University EMAIL
Pseudocode	Yes	Algorithm 1 Main Causal Discovery Algorithm Algorithm 2 Algorithm for Creating F Nodes Algorithm 3 Algorithm for Finding Separating Set
Open Source Code	Yes	Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We discuss this in Appendix F. The instructions are provided in the Readme file.
Open Datasets	No	In this experiment, we compare the I-MEC size under hard and soft interventions. For a given number of observable nodes n, we create an arbitrary ADMG by first constructing a DAG and then adding bidirected edges to it. Then, we enumerate all ADMGs of the same size and check if the ADMG is in the I-MEC. For hard interventions, we construct the I-augmented MAGs according to the steps in Definition 5.1, and then check if Theorem 4.7 holds. For soft interventions, we refer to the construction in Definition 4 and criteria in Theorem 2 in Kocaoglu et al. [2019]. We count the number of ADMGs that are in the I-MEC and take average over 50 random ADMGs and compute the standard error. The results are shown in Table 1.
Dataset Splits	No	The paper describes generating synthetic data for experiments and evaluating equivalence classes of graphs. It mentions
Hardware Specification	Yes	All the experiments are run on an NVIDIA Ge Force RTX 3090 graphics card.
Software Dependencies	No	The paper states: "We use the Python implementation of GIES by Olga Kolotuhina and Juan L. Gamella [Gamella, 2025]. The IGSP implementation is from the causaldag package [Squires, 2018]." These are specific tools used, but their specific version numbers are not explicitly mentioned in the text for their own implementation, nor are other key software components with versions.
Experiment Setup	Yes	We choose ε = 0.01 and exp(−2Mε^2) = 0.01 for M with M = 23025. For each setting, we randomly sample 50 ground truth ADMGs and then take the average.