Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Integrating Large Language Models in Causal Discovery: A Statistical Causal Approach

Authors: MASAYUKI TAKAYAMA, Tadahisa OKUDA, Thong Pham, Tatsuyoshi Ikenoue, Shingo Fukuma, Shohei Shimizu, Akiyoshi Sannai

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments in this work have revealed that the results of LLM-KBCI and SCD augmented with LLM-KBCI approach the ground truths, more than the SCD result without prior knowledge. These experiments have also revealed that the SCD result can be further improved if the LLM undergoes SCP. Furthermore, with an unpublished real-world dataset, we have demonstrated that the background knowledge provided by the LLM can improve the SCD on this dataset, even if this dataset has never been included in the training data of the LLM.
Researcher Affiliation	Academia	Masayuki Takayama EMAIL Data Science and AI Innovation Research Promotion Center, Shiga University National Institute of Science and Technology Policy (NISTEP) Tadahisa Okuda Department of Health Data Science, Tokyo Medical University Graduate School of Medicine Human Health Sciences, Kyoto University Thong Pham Data Science and AI Innovation Research Promotion Center, Shiga University Graduate School of Medicine Human Health Sciences, Kyoto University Center for Advanced Intelligence Project, RIKEN Tatsuyoshi Ikenoue Data Science and AI Innovation Research Promotion Center, Shiga University Graduate School of Medicine Human Health Sciences, Kyoto University Shingo Fukuma Graduate School of Medicine Human Health Sciences, Kyoto University Department of Epidemiology Infectious Disease Control and Prevention, Hiroshima University Graduate School of Biomedical and Health Sciences Data Science and AI Innovation Research Promotion Center, Shiga University Shohei Shimizu EMAIL SANKEN, The University of Osaka Faculty of Data Science, Shiga University Center for Advanced Intelligence Project, RIKEN Graduate School of Medicine Human Health Sciences, Kyoto University Institute for the Advanced Study of Human Biology, Kyoto University National Institute of Science and Technology Policy (NISTEP) Akiyoshi Sannai EMAIL Department of Physics, Kyoto University Data Science and AI Innovation Research Promotion Center, Shiga University Graduate School of Engineering, The University of Tokyo Center for Advanced Intelligence Project, RIKEN Research and Development Center for Large Language Models, National Institute of Informatics National Institute of Science and Technology Policy (NISTEP)
Pseudocode	Yes	Algorithm 1 Background knowledge construction with the LLM prompted with the results of the SCD Input 1: Data X with variables{x1, ..., xn} Input 2: SCD method S (the one selected from PC, Exact Search, and Direct Li NGAM) Input 3: Function for bootstrap B Input 4: Response of the domain expert (LLM) ϵLLM Input 5: Log probability of the response L(ϵLLM) Input 6: Prompt function for knowledge generation Q(1) ij Input 7: Prompt function for knowledge integration Q(2) ij Input 8: Transformation from the probability matrix to prior knowledge T Input 9: Number of times to measure the probability M Output: Result of SCD with prior knowledge ˆG on X SCD result without prior knowledge ˆ G0 = S(X) bootstrap probability matrix P = B(S, X) for i = 1 to n do for j = 1 to n do pi,i = Na N if i = j then prompt q(1) ij = Q(1) ij (xi, xj, ˆ G0, P ) response aij = ϵLLM(q(1) ij ) prompt q(2) ij = Q(2) ij (q(1) ij , aij) for m = 1 to M do L(m) ij = L(m)(ϵLLM(q(2) ij ) = yes ) end for mean probability pij = PM m=1 exp(L(m) ij ) M end if end for end for probability matrix p = (pij) prior knowledge PK = T(p) return ˆG = S(X, PK)
Open Source Code	Yes	The code used in this work is publicly available1. 1https://github.com/mas-takayama/LLM-and-SCD
Open Datasets	Yes	Consequently, we select three benchmark datasets for the experiments, as follows: 1. Auto MPG data (Quinlan, 1993) (five continuous variables) , 2. Deutscher Wetterdienst (DWD) climate data (Mooij et al., 2016) (six continuous variables) , 3. Sachs protein data (Sachs et al., 2005) (eleven continuous variables) .
Dataset Splits	No	In our experiments, the number of bootstrap resamplings was fixed to 1000. ... For the experiment, to confirm that GPT-4 can supply SCD with adequate prior knowledge, even if a direct edge of the ground truth is not apparent, we repeated the sampling of 1000 points from the entire dataset, until we obtained a subset on which PC, Exact Search, and Direct Li NGAM cannot discern the causal relationship Age Hb A1c without prior knowledge.
Hardware Specification	No	The paper mentions using specific LLMs like "GPT-4-1106-preview developed by Open AI" and other GPT-series models or open-source LLMs. However, it does not specify the underlying hardware (e.g., GPU models, CPU types, memory) on which the experiments, including running the SCD algorithms or interacting with the LLMs, were performed.
Software Dependencies	Yes	To utilize the LLM as the domain expert, we adopted GPT-4-1106-preview8 developed by Open AI; the temperature, a hyperparameter for adjusting the probability distribution of the output, was fixed to 0.7. ...which are open in causal-learn (Zheng et al., 2023b) and Li NGAM (Ikeuchi et al., 2023).
Experiment Setup	Yes	To utilize the LLM as the domain expert, we adopted GPT-4-1106-preview8 developed by Open AI; the temperature, a hyperparameter for adjusting the probability distribution of the output, was fixed to 0.7. ...In our experiments, the number of bootstrap resamplings was fixed to 1000. ...For the decision of a forbidden or forced causal relationship from PKij, we prepare the probability criterion for the forbidden path as α1 and that for the forced path as α2. In our experiments, α1 is fixed at 0.05, and α2 is fixed at 0.95 for common settings13. ...Thus, we adopted the mean probability pij of the single-shot measurement M times for the decision of prior knowledge matrix PK, and we set M = 5 in the experiments.