Learning Bayesian Networks with Thousands of Variables
Authors: Mauro Scanagatta, Cassio P. de Campos, Giorgio Corani, Marco Zaffalon
NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test our approach on data sets containing up to ten thousand variables. As a performance indicator we consider the score of the network found. Our parent set identification approach outperforms consistently the usual approach of setting the maximum in-degree and then computing the score of all parent sets. Our structure optimization approach outperforms Gobnilp when learning with more than 500 nodes. All the software and data sets used in the experiments are available online. |
| Researcher Affiliation | Academia | Mauro Scanagatta IDSIA , SUPSI , USI Lugano, Switzerland mauro@idsia.ch Cassio P. de Campos Queen s University Belfast Northern Ireland, UK c.decampos@qub.ac.uk Giorgio Corani IDSIA , SUPSI , USI Lugano, Switzerland giorgio@idsia.ch Marco Zaffalon IDSIA Lugano, Switzerland zaffalon@idsia.ch |
| Pseudocode | Yes | 1. Build and keep a Boolean square matrix m to mark which are the descendants of nodes (m(X, Y ) tells whether Y is descendant of X). Start it all false. 2. For each node Vj in the order, with j = n, . . . , 1: |
| Open Source Code | Yes | All the software and data sets used in the experiments are available online. |
| Open Datasets | Yes | We consider 16 data sets already used in the literature of structure learning, firstly introduced in [13] and [8]. We take the largest networks available in the literature: andes (n=223), diabetes (n=413), pigs (n=441), link (n=724), munin (n=1041). Additionally we randomly generate other 15 networks: five networks of size 2000, five networks of size 4000, five networks of size 10000. Each variable has a number of states randomly drawn from 2 to 4 and a number of parents randomly drawn from 0 to 6. Overall we consider 20 networks. From each network we sample a data set of 5000 instances. |
| Dataset Splits | No | We randomly split each data set into three subsets of instances. (No specific percentages or counts are given, making it non-reproducible for exact splits). |
| Hardware Specification | No | For a given data set the computation is performed on the same machine. (No specific hardware details are provided). |
| Software Dependencies | No | The largest data set analyzed in [1] with the Gobnilp1 software contains 413 variables and In [5] Gobnilp is used for structural learning with 1614 variables. No version numbers are provided for any software. |
| Experiment Setup | Yes | We allow one minute per variable to each approach for parent set identification. We set the maximum in-degree to k = 6, a high value that allows learning even complex structures. |