Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Systems with Switching Causal Relations: A Meta-Causal Perspective

Authors: Moritz Willig, Tim Tobiasch, Florian Busch, Jonas Seng, Devendra Singh Dhami, Kristian Kersting

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our goal in this experiment is to recover the number of meta-causal states K [1..4] from data that exists between two variables X, Y that are directly connected by a linear equation with added noise. ... We evaluate our approach over all k [1..4] by generating 100 different datasets for every particular number of mechanisms. For every dataset we sample 500 data points from each mechanism (xk i , yk i ) = fk(αkxk i +βk +li) where li L(0, bk), using the same sampling method as before (c.f. Appendix G). Finally, the algorithm recovers the number of mechanisms. ... Table 1 shows the confusion matrices between the actual number of mechanism and the predicted number for different values of maximum class imbalances.
Researcher Affiliation	Academia	1Department of Computer Science, Technical University of Darmstadt, Germany 2Hessian Center for AI (hessian.AI), Germany 3Dept. of Mathematics and Computer Science, Eindhoven University of Technology, Netherlands 4Centre for Cognitive Science, Technical University of Darmstadt, Germany 5German Research Center for AI (DFKI), Germany EMAIL
Pseudocode	Yes	We provide the pseudo code for our method in Algorithm. 1 in the Appendix.
Open Source Code	Yes	Code is made available at https://github.com/MoritzWillig/metaCausalModels.
Open Datasets	No	Our goal in this experiment is to recover the number of meta-causal states K [1..4] from data that exists between two variables X, Y that are directly connected by a linear equation with added noise. ... For every dataset we sample 500 data points from each mechanism (xk i , yk i ) = fk(αkxk i +βk +li) where li L(0, bk), using the same sampling method as before (c.f. Appendix G).
Dataset Splits	No	The paper describes generating synthetic data points for experiments but does not specify how these generated data points were divided into training, validation, or test sets in a typical machine learning context. The goal is to recover mechanisms, not to evaluate predictive performance on pre-split datasets.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using L1-regression and the Anderson-Darling test, implying the use of statistical or machine learning libraries, but does not provide specific version numbers for any software components (e.g., Python, PyTorch, scikit-learn versions).
Experiment Setup	Yes	We assume that each meta-causal state gives rise to a different linear equation fk := αk X + βk + N, k N, where αk, βk are the slope and intercept of the respective mechanism and N is a zero-centered, symmetric, and quasiconvex noise distribution. ... We perform 5 EM steps for setups with k = 1 and k = 2 mechanisms, and increase to 10 EM iterations for 3 and 4 mechanisms. ... if the slope and intercept of the true and predicted values do not differ by more than an absolute value of 0.2. ... The slopes of the linear equations are uniformly sampled between α [0.2..5] and the intercepts are in the range β [ 5, 5]. We add Laplacian noise L(x\|µ, b) = 1 2b exp( \|x µ\| b ) with µ = 0 and b [0.1, 4.0]. X values are uniformly sampled in the range [ 5, 5] and yi = αxi + β + L(x\|0, b). The average number of samples per class is set to 500.