Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Not All Causal Inference is the Same

Authors: Matej Zečević, Devendra Singh Dhami, Kristian Kersting

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate the newly-introduced LTNCM (Def.5) specifically. We first sanity check the model by checking for causal effect and general density estimation. Then we conduct two experiments regarding tractability of causal inference. More specifically, we answer the following questions: Q1. To which degree are causal effects being captured on qualitatively different structures? Q2. How is the estimation quality for interventional distribution modelling? Q3. How does time complexity scale when increasing the SCM s size, that is, number of modelling units (NU)? Q4. How does time complexity scale when increasing the size of each unit per SCM structural equation (SU)? TL;DR. The questions Q1-2 are both answered in favor of LTNCM (see Fig.5 and Tab.2), that is, both causal effect estimation as well as general density estimation are competitive with standard NCM, while Q3-4 confirm our previous discussions in terms of general inference being intractable and mechanism inference being linear for the LTNCM (see Fig.6).
Researcher Affiliation	Academia	Matej Zečević EMAIL Computer Science Department, TU Darmstadt, Germany Devendra Singh Dhami EMAIL Computer Science Department, TU Darmstadt, Germany Hessian Center for AI (hessian.AI), Germany Kristian Kersting EMAIL Computer Science Department, TU Darmstadt, Germany Centre for Cognitive Science, TU Darmstadt, Germany Hessian Center for AI (hessian.AI), Germany German Research Center for Artificial Intelligence (DFKI), Germany
Pseudocode	No	The paper describes models and proofs using mathematical notation and prose, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	Yes	We make our code repository for reproducing the empirical part with the LTNCM and visualizations publicly available at: https://github.com/zecevic-matej/Not-All-Causal-Inference-is-the-Same
Open Datasets	No	Since we are interested in qualitative behavior in light of the theoretical results established previously, we consider custom SCM simulations. For details regarding our synthetic data sets (that is, the used SCM families), the overall protocol and hyperparameters we point to appendix A.1. The paper describes how the synthetic data was generated but does not provide access to a public dataset.
Dataset Splits	No	we used learned models for four different random seeds and for each parameterization of any given underlying SCM. For the NCM s neural networks, we deploy simple MLP with three hidden layers of 10 neurons each, and the input-/output-layers are \| Pai \| 1 and 1 respectively. For the LTNCM s SPNs, we deploy simple two-layer SPNs (following the layerwise principle introduced in Peharz et al. (2020a)) where the first layer consists of leaf nodes, the second layer of product nodes, the third layer of sum nodes and a final product node aggregation. The number of channels is set to 30. We use ADAM (Kingma & Ba, 2014) optimization, and train up to three passes of 10k data points sampled from the observational distribution of any SCM. The paper describes data generation and training passes, but not explicit train/test/validation splits for a fixed dataset.
Hardware Specification	Yes	All experiments are being performed on a Mac Book Pro (13-inch, 2020, Four Thunderbolt 3 ports) laptop running a 2,3 GHz Quad-Core Intel Core i7 CPU with a 16 GB 3733 MHz LPDDR4X RAM on time scales ranging from a few seconds up to an hour for the longest experiment setting.
Software Dependencies	No	We use ADAM (Kingma & Ba, 2014) optimization, and train up to three passes of 10k data points sampled from the observational distribution of any SCM. The paper mentions an optimization algorithm (ADAM) but does not list specific software libraries or frameworks with version numbers used for implementation.
Experiment Setup	Yes	For the NCM s neural networks, we deploy simple MLP with three hidden layers of 10 neurons each, and the input-/output-layers are \| Pai \| 1 and 1 respectively. For the LTNCM s SPNs, we deploy simple two-layer SPNs (following the layerwise principle introduced in Peharz et al. (2020a)) where the first layer consists of leaf nodes, the second layer of product nodes, the third layer of sum nodes and a final product node aggregation. The number of channels is set to 30. We use ADAM (Kingma & Ba, 2014) optimization, and train up to three passes of 10k data points sampled from the observational distribution of any SCM. For causal effect estimation, we focus on the average treatment effect given by ATEp T, Eq : Er E\| dop T 1qs Er E\| dop T 1qs that for the binary setting reduces to probabilistic difference pp Y 1\| dop X 1qq pp Y 1\| dop X 0qq ATEp T, Eq. For measuring density estimation quality, we resort to the Jensen-Shannon-Divergence (JSD) with base 2 that is bounded in r0, 1s where 0 indicates identical probability density functions i.e., an optimal match in terms of JSD.