Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Clustering and Structural Robustness in Causal Diagrams

Authors: Santtu Tikka, Jouni Helske, Juha Karvanen

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide a sound and complete algorithm for finding all transit clusters in a given graph and demonstrate how clustering can simplify the identification of causal effects. We also study the inverse problem, where one starts with a clustered graph and looks for extended graphs where the identifiability properties of causal effects remain unchanged. We show that this kind of structural robustness is closely related to transit clusters. Keywords: causal inference, graph theory, algorithm, identifiability, directed acyclic graph. For example, in case of the Sangiovese graph, the causaleffect package returns (1) in 0.1 seconds with simplification option disabled and 132 seconds with simplification enabled (which in this case does not lead to simpler equation) on a standard laptop. On the other hand, running the clustering algorithm and subsequent identification (which returns (2)) takes only 0.5 seconds.
Researcher Affiliation	Academia	Santtu Tikka EMAIL Jouni Helske EMAIL Juha Karvanen EMAIL Department of Mathematics and Statistics P.O.Box 35 (Ma D) FI-40014 University of Jyvaskyla, Finland
Pseudocode	Yes	Algorithm 1 An algorithm for finding all (restricted) transit components of a DAG. The inputs are a DAG G = (V, E) and a restriction set R whose members are the only vertices in V that are allowed to belong to any transit component. Returns the set of all restricted transit components in G with respect to R. 1: function Find Tr Comp(G, R) 2: A 3: VCh {Ch (v) R \| v V } 4: VPa {Pa (v) R \| v V } 5: for all Vi VCh do 6: for all Vj VPa do 7: if Vj = then Z Vi An G(Vj) else Z Vi 8: if Vi = then W Vj De G(Vi) else W Vj 9: if ( zk Z s.t. Pa G(zk) = ) ( wk W s.t. Ch G(wk) = ) then 10: continue 11: if Z = W = then 12: A Co G[Z,W ](Z W) 13: if A R then 14: for all Ak C(G[A]) do 15: Zk Z Ak 16: Wk W Ak 17: ZPa T z Zk Pa G(z) \ Ak 18: WCh T w Wk Ch G(w) \ Ak 19: if ZPa = Pa (Zk) \ Ak WCh = Ch (Wk) \ Ak then 20: A A {Ak} 21: return A Algorithm 2 An algorithm for finding all (restricted) transit clusters of a DAG. The inputs are a DAG G = (V, E) and the set of all transit components TG\|R with respect to R V . Returns the set of all restricted transit clusters in G with respect to R. The subroutine Expand Clust is used to recursively construct the clusters. 1: function Find Tr Clust(G, TG\|R) 2: A TG\|R 3: B TG\|R 4: for all T TG\|R do 5: B B \ {T} 6: A A Expand Clust(T, A, B, G) 7: return A 1: function Expand Clust(T, A, B, G) 2: B B 3: for all S B do 4: B B \ {S} 5: if S T A and Corollary 10 holds for S and T then 6: A A {S T} Expand Clust(S T, A, B , G) 7: return A
Open Source Code	Yes	Code for the clustering algorithms and examples are available in a Git Hub repository: https://github.com/santikka/transit_cluster.
Open Datasets	Yes	We consider a graph related to the Sangiovese grapes studied earlier as a conditional linear Gaussian network by Magrini et al. (2017)... The example is based on the National Health and Nutrition Examination Survey (NHANES, https://wwwn.cdc.gov/nchs/nhanes/) 2015 2016 that is an observational study on the health and nutritional status of adults and children in the United States.
Dataset Splits	No	The paper uses existing datasets for illustrative purposes (Sangiovese grapes, NHANES) but does not describe any specific training, validation, or test splits for these datasets. The focus is on the theoretical properties of clustering algorithms and their application to causal effect identification rather than machine learning model training.
Hardware Specification	Yes	For example, in case of the Sangiovese graph, the causaleffect package returns (1) in 0.1 seconds with simplification option disabled and 132 seconds with simplification enabled (which in this case does not lead to simpler equation) on a standard laptop.
Software Dependencies	Yes	The R (R Core Team, 2022) package causaleffect (Tikka and Karvanen, 2017a) implements an automatic simplification algorithm for this task
Experiment Setup	No	The paper describes comparing the runtime of algorithms with and without a 'simplification option enabled' and mentions specific runtimes (0.1, 132, 0.5 seconds). However, these are performance benchmarks for an identification algorithm rather than detailing hyperparameters, training configurations, or model initialization for a machine learning experiment.