Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Integrating Large Language Models in Causal Discovery: A Statistical Causal Approach
Authors: MASAYUKI TAKAYAMA, Tadahisa OKUDA, Thong Pham, Tatsuyoshi Ikenoue, Shingo Fukuma, Shohei Shimizu, Akiyoshi Sannai
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments in this work have revealed that the results of LLM-KBCI and SCD augmented with LLM-KBCI approach the ground truths, more than the SCD result without prior knowledge. These experiments have also revealed that the SCD result can be further improved if the LLM undergoes SCP. Furthermore, with an unpublished real-world dataset, we have demonstrated that the background knowledge provided by the LLM can improve the SCD on this dataset, even if this dataset has never been included in the training data of the LLM. |
| Researcher Affiliation | Academia | Masayuki Takayama EMAIL Data Science and AI Innovation Research Promotion Center, Shiga University National Institute of Science and Technology Policy (NISTEP) Tadahisa Okuda Department of Health Data Science, Tokyo Medical University Graduate School of Medicine Human Health Sciences, Kyoto University Thong Pham Data Science and AI Innovation Research Promotion Center, Shiga University Graduate School of Medicine Human Health Sciences, Kyoto University Center for Advanced Intelligence Project, RIKEN Tatsuyoshi Ikenoue Data Science and AI Innovation Research Promotion Center, Shiga University Graduate School of Medicine Human Health Sciences, Kyoto University Shingo Fukuma Graduate School of Medicine Human Health Sciences, Kyoto University Department of Epidemiology Infectious Disease Control and Prevention, Hiroshima University Graduate School of Biomedical and Health Sciences Data Science and AI Innovation Research Promotion Center, Shiga University Shohei Shimizu EMAIL SANKEN, The University of Osaka Faculty of Data Science, Shiga University Center for Advanced Intelligence Project, RIKEN Graduate School of Medicine Human Health Sciences, Kyoto University Institute for the Advanced Study of Human Biology, Kyoto University National Institute of Science and Technology Policy (NISTEP) Akiyoshi Sannai EMAIL Department of Physics, Kyoto University Data Science and AI Innovation Research Promotion Center, Shiga University Graduate School of Engineering, The University of Tokyo Center for Advanced Intelligence Project, RIKEN Research and Development Center for Large Language Models, National Institute of Informatics National Institute of Science and Technology Policy (NISTEP) |
| Pseudocode | Yes | Algorithm 1 Background knowledge construction with the LLM prompted with the results of the SCD Input 1: Data X with variables{x1, ..., xn} Input 2: SCD method S (the one selected from PC, Exact Search, and Direct Li NGAM) Input 3: Function for bootstrap B Input 4: Response of the domain expert (LLM) ϵLLM Input 5: Log probability of the response L(ϵLLM) Input 6: Prompt function for knowledge generation Q(1) ij Input 7: Prompt function for knowledge integration Q(2) ij Input 8: Transformation from the probability matrix to prior knowledge T Input 9: Number of times to measure the probability M Output: Result of SCD with prior knowledge ˆG on X SCD result without prior knowledge ˆ G0 = S(X) bootstrap probability matrix P = B(S, X) for i = 1 to n do for j = 1 to n do pi,i = Na N if i = j then prompt q(1) ij = Q(1) ij (xi, xj, ˆ G0, P ) response aij = ϵLLM(q(1) ij ) prompt q(2) ij = Q(2) ij (q(1) ij , aij) for m = 1 to M do L(m) ij = L(m)(ϵLLM(q(2) ij ) = yes ) end for mean probability pij = PM m=1 exp(L(m) ij ) M end if end for end for probability matrix p = (pij) prior knowledge PK = T(p) return ˆG = S(X, PK) |
| Open Source Code | Yes | The code used in this work is publicly available1. 1https://github.com/mas-takayama/LLM-and-SCD |
| Open Datasets | Yes | Consequently, we select three benchmark datasets for the experiments, as follows: 1. Auto MPG data (Quinlan, 1993) (five continuous variables) , 2. Deutscher Wetterdienst (DWD) climate data (Mooij et al., 2016) (six continuous variables) , 3. Sachs protein data (Sachs et al., 2005) (eleven continuous variables) . |
| Dataset Splits | No | In our experiments, the number of bootstrap resamplings was fixed to 1000. ... For the experiment, to confirm that GPT-4 can supply SCD with adequate prior knowledge, even if a direct edge of the ground truth is not apparent, we repeated the sampling of 1000 points from the entire dataset, until we obtained a subset on which PC, Exact Search, and Direct Li NGAM cannot discern the causal relationship Age Hb A1c without prior knowledge. |
| Hardware Specification | No | The paper mentions using specific LLMs like "GPT-4-1106-preview developed by Open AI" and other GPT-series models or open-source LLMs. However, it does not specify the underlying hardware (e.g., GPU models, CPU types, memory) on which the experiments, including running the SCD algorithms or interacting with the LLMs, were performed. |
| Software Dependencies | Yes | To utilize the LLM as the domain expert, we adopted GPT-4-1106-preview8 developed by Open AI; the temperature, a hyperparameter for adjusting the probability distribution of the output, was fixed to 0.7. ...which are open in causal-learn (Zheng et al., 2023b) and Li NGAM (Ikeuchi et al., 2023). |
| Experiment Setup | Yes | To utilize the LLM as the domain expert, we adopted GPT-4-1106-preview8 developed by Open AI; the temperature, a hyperparameter for adjusting the probability distribution of the output, was fixed to 0.7. ...In our experiments, the number of bootstrap resamplings was fixed to 1000. ...For the decision of a forbidden or forced causal relationship from PKij, we prepare the probability criterion for the forbidden path as α1 and that for the forced path as α2. In our experiments, α1 is fixed at 0.05, and α2 is fixed at 0.95 for common settings13. ...Thus, we adopted the mean probability pij of the single-shot measurement M times for the decision of prior knowledge matrix PK, and we set M = 5 in the experiments. |