Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Scalable Bayesian Learning of Causal DAGs
Authors: Jussi Viinikka, Antti Hyttinen, Johan Pensar, Mikko Koivisto
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments demonstrate the performance of our methods in detecting ancestor descendant relations, and in causal effect estimation our Bayesian method is shown to outperform previous approaches. |
| Researcher Affiliation | Academia | Jussi Viinikka Department of Computer Science University of Helsinki EMAIL Antti Hyttinen HIIT & Departiment of Computer Science University of Helsinki EMAIL Johan Pensar Department of Mathematics University of Oslo EMAIL Mikko Koivisto Department of Computer Science University of Helsinki EMAIL |
| Pseudocode | Yes | Algorithm 1 The Gadget method for sampling DAGs; Algorithm 2 The Beeps method for sampling from the posterior of linear causal effects. |
| Open Source Code | Yes | We provide a Python interface for both algorithms, with many time critical parts implemented in C++. For source code see https://www.cs.helsinki.fi/group/sop/gadget-beeps. |
| Open Datasets | Yes | We also ran the algorithms on 8 data sets obtained from the UCI machine learning repository [6], with up to 23 variables, using available preprocessed sets [25]... Finally, we obtained 50 datasets with 100 1600 data points from a benchmark Gaussian BN on gene expressions of Arabidopsis thaliana with n = 107 nodes [34, 30]. |
| Dataset Splits | No | The paper does not explicitly provide details about training, validation, or test dataset splits. It mentions generating or obtaining data but no specific split percentages or counts. |
| Hardware Specification | No | The paper does not specify the hardware (e.g., CPU, GPU models) used to run the experiments. |
| Software Dependencies | No | The paper mentions 'Python interface' and 'C++', and refers to 'standard software [33, 13, 12, 1]' (e.g., 'bnlearn R package'), but does not provide specific version numbers for any software components. |
| Experiment Setup | Yes | We set n to 20 to enable exact evaluation of the achieved coverage and comparison to the best possible performance (Opt, cf. Prop. 2). We sampled two data sets of size N = 50 and N = 200 from each of 100 synthetic linear Gaussian DAGs, generated so that the expected neighborhood size was 4, the edge coefficients and the variances of the disturbances uniformly distributed on [0.1, 2] and [0.5, 2], respectively... We ran M = 16 shorter heated chains in parallel... selecting K = 6, 9, 12 candidate parents. |