Improving Causal Discovery By Optimal Bayesian Network Learning
Authors: Ni Y Lu, Kun Zhang, Changhe Yuan8741-8748
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we demonstrate that optimal score-based exhaustive search is remarkably useful for causal discovery: it requires weaker conditions to guarantee asymptotic correctness, and outperforms wellknown methods including PC, GES, GSP, and NOTEARS. In order to achieve scalability, we also develop an approximation algorithm for larger systems based on the A* method, which scales up to 60+ variables and obtains better results than existing greedy algorithms such as GES, MMHC, and GSP. Our results illustrate the risk of assuming the faithfulness assumption, the advantages of exhaustive search methods, and the limitations of greedy search methods, and shed light on the computational challenges and techniques in scaling up to larger networks and handling unfaithful data. |
| Researcher Affiliation | Academia | 1 Graduate Center, City University of New York, New York, NY 2 Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 3 Department of Computer Science, Queens College, Queens, NY |
| Pseudocode | No | The paper describes the 'Triplet Method' conceptually but does not include a formal 'Algorithm' or 'Pseudocode' block. |
| Open Source Code | No | The paper mentions using third-party open-source libraries and R/Python packages (e.g., 'open-source A* library', 'R package pcalg', 'Python package causaldag') but does not state that the authors are releasing their own code for the methodology described in this paper. |
| Open Datasets | No | The paper states that data was 'generated' ('generated 30 random linear Gaussian networks', 'use the R package by (Hyttinen, Eberhardt, and J arvisalo 2014) to generate random DAGs') and refers to data used in figures 1 and 2, but it does not provide access information (link, DOI, citation) to any publicly available dataset. |
| Dataset Splits | Yes | Then we run A* and GES on n data points generated by the causal model, where n = 50, 100, 200, 500, 1k, 2k, 5k, 10k, 20k, 50k, 100k, 200k. The performance metrics, precision, recall, and F1 score are averaged over 30 networks for each sample size n. ... We run GES and A* on λ = 0.5, 1, 2, where λ is the penalization coefficient in the BIC score function Equation 2. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU, CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using specific software packages like 'R package pcalg', 'R package MXM', and 'Python package causaldag', but it does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | The curves are similar in shape for these λ values, so we only show λ = 1 in the plots for better readability. We run GES and A* on λ = 0.5, 1, 2, where λ is the penalization coefficient in the BIC score function Equation 2. ... We find α = 0.0001 optimal for GSP on networks of 60 variables at sample size 500. For SAT and SP, we use α [0.2, 0.4] to get optimal performance. For PC, we try α = 0.01, 0.05, 0.1, and α = 0.05 gives the highest F1 score. |